None 02_practica1

Desarrollo notebook 2

Valores missing, outlier y correlaciones

En este notebook se realizara el estudio y preprocesamiento de las variables categóricas, continuas y booleanas, de acuerdo con la siguiente estrcutura:
  1. Asignación del tipo de variable

    • Conversión de tipo de datos
  2. Separación en train y test estratificado

  3. Visualización descriptiva de los datos

  4. Gráficos de distribución de las variables

  5. Tratamiento de variables continuas

    • Gráfico de correlación
    • Tratamiento de valores nulos
    • Imputar valores nulos
  6. Tratamiento de variables categóricas y booleanas

    • Tratamiento de valores nulos
    • Imputar valores nulos

Importar librerías

In [25]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import sklearn
from sklearn.impute import KNNImputer
import scipy.stats as ss
import warnings
from sklearn.model_selection import train_test_split

import sys
sys.path.append('/Users/miguelflores/Desktop/P1/practica1')
from funciones import funciones_auxiliares as f_aux

semilla = 42

pd.set_option("display.max_rows", 10000)
pd.set_option("display.max_columns", 10000)
pd.set_option("display.width", 10000)

Lectura de datos del preprocesado inicial

In [26]:
df = pd.read_csv('/Users/miguelflores/Desktop/CSV/pd_data_initial_preprocessing.csv').set_index('SK_ID_CURR')
df
Out[26]:
TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR NWEEKDAY_PROCESS_START
SK_ID_CURR
100002 1 Cash loans M 0 1 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.0220 0.0198 0.0 0.0000 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.0000 reg oper account block of flats 0.0149 Stone, brick 0 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 3
100003 0 Cash loans F 0 0 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.0790 0.0554 0.0 0.0000 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.0100 reg oper account block of flats 0.0714 Block 0 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 1
100004 0 Revolving loans M 1 1 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 1
100006 0 Cash loans F 0 1 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 3
100007 0 Cash loans M 0 1 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
456251 0 Cash loans M 0 0 0 157500.0 254700.0 27558.0 225000.0 Unaccompanied Working Secondary / secondary special Separated With parents 0.032561 -9327 -236 -8456.0 -1982 NaN 1 1 0 1 0 0 Sales staff 1.0 1 1 15 0 0 0 0 0 0 Services 0.145570 0.681632 NaN 0.2021 0.0887 0.9876 0.8300 0.0202 0.22 0.1034 0.6042 0.2708 0.0594 0.1484 0.1965 0.0753 0.1095 0.1008 0.0172 0.9782 0.7125 0.0172 0.0806 0.0345 0.4583 0.0417 0.0094 0.0882 0.0853 0.0 0.0125 0.2040 0.0887 0.9876 0.8323 0.0203 0.22 0.1034 0.6042 0.2708 0.0605 0.1509 0.2001 0.0757 0.1118 reg oper account block of flats 0.2898 Stone, brick 0 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 4
456252 0 Cash loans F 0 1 0 72000.0 269550.0 12001.5 225000.0 Unaccompanied Pensioner Secondary / secondary special Widow House / apartment 0.025164 -20775 365243 -4388.0 -4090 NaN 1 0 0 1 1 0 NaN 1.0 2 2 8 0 0 0 0 0 0 XNA NaN 0.115992 NaN 0.0247 0.0435 0.9727 0.6260 0.0022 0.00 0.1034 0.0833 0.1250 0.0579 0.0202 0.0257 0.0000 0.0000 0.0252 0.0451 0.9727 0.6406 0.0022 0.0000 0.1034 0.0833 0.1250 0.0592 0.0220 0.0267 0.0 0.0000 0.0250 0.0435 0.9727 0.6310 0.0022 0.00 0.1034 0.0833 0.1250 0.0589 0.0205 0.0261 0.0000 0.0000 reg oper account block of flats 0.0214 Stone, brick 0 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 1
456253 0 Cash loans F 0 1 0 153000.0 677664.0 29979.0 585000.0 Unaccompanied Working Higher education Separated House / apartment 0.005002 -14966 -7921 -6737.0 -5150 NaN 1 1 0 1 0 1 Managers 1.0 3 3 9 0 0 0 0 1 1 School 0.744026 0.535722 0.218859 0.1031 0.0862 0.9816 0.7484 0.0123 0.00 0.2069 0.1667 0.2083 NaN 0.0841 0.9279 0.0000 0.0000 0.1050 0.0894 0.9816 0.7583 0.0124 0.0000 0.2069 0.1667 0.2083 NaN 0.0918 0.9667 0.0 0.0000 0.1041 0.0862 0.9816 0.7518 0.0124 0.00 0.2069 0.1667 0.2083 NaN 0.0855 0.9445 0.0000 0.0000 reg oper account block of flats 0.7970 Panel 0 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0.0 0.0 1.0 0.0 1.0 4
456254 1 Cash loans F 0 1 0 171000.0 370107.0 20205.0 319500.0 Unaccompanied Commercial associate Secondary / secondary special Married House / apartment 0.005313 -11961 -4786 -2562.0 -931 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 9 0 0 0 1 1 0 Business Entity Type 1 NaN 0.514163 0.661024 0.0124 NaN 0.9771 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0061 NaN NaN 0.0126 NaN 0.9772 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0063 NaN NaN 0.0125 NaN 0.9771 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0062 NaN NaN NaN block of flats 0.0086 Stone, brick 0 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 3
456255 0 Cash loans F 0 0 0 157500.0 675000.0 49117.5 675000.0 Unaccompanied Commercial associate Higher education Married House / apartment 0.046220 -16856 -1262 -5128.0 -410 NaN 1 1 1 1 1 0 Laborers 2.0 1 1 20 0 0 0 0 1 1 Business Entity Type 3 0.734460 0.708569 0.113922 0.0742 0.0526 0.9881 NaN 0.0176 0.08 0.0690 0.3750 NaN NaN NaN 0.0791 NaN 0.0000 0.0756 0.0546 0.9881 NaN 0.0178 0.0806 0.0690 0.3750 NaN NaN NaN 0.0824 NaN 0.0000 0.0749 0.0526 0.9881 NaN 0.0177 0.08 0.0690 0.3750 NaN NaN NaN 0.0805 NaN 0.0000 NaN block of flats 0.0718 Panel 0 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 2.0 0.0 1.0 4

307511 rows × 121 columns

Asignación de tipo de variable (Categórica, Continua y Booleana)

A continuación, como previamente se había visualizado en el notebook 1, se realizará una categorización por cada tipo de variable, introduciendolas a listas, para posteriormente asignar el tipo de estas.

In [27]:
f_aux.clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
============================================================================================================================================================================
Variables Categóricas: 14 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE']
============================================================================================================================================================================
Variables Continuas: 65 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
============================================================================================================================================================================
Variables no clasificadas: 6 ['CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START', 'NWEEKDAY_PROCESS_START']
Out[27]:
(['TARGET',
  'FLAG_OWN_CAR',
  'FLAG_OWN_REALTY',
  'FLAG_MOBIL',
  'FLAG_EMP_PHONE',
  'FLAG_WORK_PHONE',
  'FLAG_CONT_MOBILE',
  'FLAG_PHONE',
  'FLAG_EMAIL',
  'REG_REGION_NOT_LIVE_REGION',
  'REG_REGION_NOT_WORK_REGION',
  'LIVE_REGION_NOT_WORK_REGION',
  'REG_CITY_NOT_LIVE_CITY',
  'REG_CITY_NOT_WORK_CITY',
  'LIVE_CITY_NOT_WORK_CITY',
  'EMERGENCYSTATE_MODE',
  'FLAG_DOCUMENT_2',
  'FLAG_DOCUMENT_3',
  'FLAG_DOCUMENT_4',
  'FLAG_DOCUMENT_5',
  'FLAG_DOCUMENT_6',
  'FLAG_DOCUMENT_7',
  'FLAG_DOCUMENT_8',
  'FLAG_DOCUMENT_9',
  'FLAG_DOCUMENT_10',
  'FLAG_DOCUMENT_11',
  'FLAG_DOCUMENT_12',
  'FLAG_DOCUMENT_13',
  'FLAG_DOCUMENT_14',
  'FLAG_DOCUMENT_15',
  'FLAG_DOCUMENT_16',
  'FLAG_DOCUMENT_17',
  'FLAG_DOCUMENT_18',
  'FLAG_DOCUMENT_19',
  'FLAG_DOCUMENT_20',
  'FLAG_DOCUMENT_21'],
 ['NAME_CONTRACT_TYPE',
  'CODE_GENDER',
  'NAME_TYPE_SUITE',
  'NAME_INCOME_TYPE',
  'NAME_EDUCATION_TYPE',
  'NAME_FAMILY_STATUS',
  'NAME_HOUSING_TYPE',
  'OCCUPATION_TYPE',
  'REGION_RATING_CLIENT',
  'REGION_RATING_CLIENT_W_CITY',
  'ORGANIZATION_TYPE',
  'FONDKAPREMONT_MODE',
  'HOUSETYPE_MODE',
  'WALLSMATERIAL_MODE'],
 ['AMT_INCOME_TOTAL',
  'AMT_CREDIT',
  'AMT_ANNUITY',
  'AMT_GOODS_PRICE',
  'REGION_POPULATION_RELATIVE',
  'DAYS_REGISTRATION',
  'OWN_CAR_AGE',
  'CNT_FAM_MEMBERS',
  'EXT_SOURCE_1',
  'EXT_SOURCE_2',
  'EXT_SOURCE_3',
  'APARTMENTS_AVG',
  'BASEMENTAREA_AVG',
  'YEARS_BEGINEXPLUATATION_AVG',
  'YEARS_BUILD_AVG',
  'COMMONAREA_AVG',
  'ELEVATORS_AVG',
  'ENTRANCES_AVG',
  'FLOORSMAX_AVG',
  'FLOORSMIN_AVG',
  'LANDAREA_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'NONLIVINGAPARTMENTS_AVG',
  'NONLIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'BASEMENTAREA_MODE',
  'YEARS_BEGINEXPLUATATION_MODE',
  'YEARS_BUILD_MODE',
  'COMMONAREA_MODE',
  'ELEVATORS_MODE',
  'ENTRANCES_MODE',
  'FLOORSMAX_MODE',
  'FLOORSMIN_MODE',
  'LANDAREA_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'NONLIVINGAPARTMENTS_MODE',
  'NONLIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'BASEMENTAREA_MEDI',
  'YEARS_BEGINEXPLUATATION_MEDI',
  'YEARS_BUILD_MEDI',
  'COMMONAREA_MEDI',
  'ELEVATORS_MEDI',
  'ENTRANCES_MEDI',
  'FLOORSMAX_MEDI',
  'FLOORSMIN_MEDI',
  'LANDAREA_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'NONLIVINGAPARTMENTS_MEDI',
  'NONLIVINGAREA_MEDI',
  'TOTALAREA_MODE',
  'OBS_30_CNT_SOCIAL_CIRCLE',
  'DEF_30_CNT_SOCIAL_CIRCLE',
  'OBS_60_CNT_SOCIAL_CIRCLE',
  'DEF_60_CNT_SOCIAL_CIRCLE',
  'DAYS_LAST_PHONE_CHANGE',
  'AMT_REQ_CREDIT_BUREAU_HOUR',
  'AMT_REQ_CREDIT_BUREAU_DAY',
  'AMT_REQ_CREDIT_BUREAU_WEEK',
  'AMT_REQ_CREDIT_BUREAU_MON',
  'AMT_REQ_CREDIT_BUREAU_QRT',
  'AMT_REQ_CREDIT_BUREAU_YEAR'],
 ['CNT_CHILDREN',
  'DAYS_BIRTH',
  'DAYS_EMPLOYED',
  'DAYS_ID_PUBLISH',
  'HOUR_APPR_PROCESS_START',
  'NWEEKDAY_PROCESS_START'])
In [28]:
f_aux.nueva_clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
============================================================================================================================================================================
Variables Categóricas: 16 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'CNT_CHILDREN', 'NWEEKDAY_PROCESS_START']
============================================================================================================================================================================
Variables Continuas: 69 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START']
=============================================================================================================================================================================
Variables no clasificadas: 0 []
Out[28]:
(['TARGET',
  'FLAG_OWN_CAR',
  'FLAG_OWN_REALTY',
  'FLAG_MOBIL',
  'FLAG_EMP_PHONE',
  'FLAG_WORK_PHONE',
  'FLAG_CONT_MOBILE',
  'FLAG_PHONE',
  'FLAG_EMAIL',
  'REG_REGION_NOT_LIVE_REGION',
  'REG_REGION_NOT_WORK_REGION',
  'LIVE_REGION_NOT_WORK_REGION',
  'REG_CITY_NOT_LIVE_CITY',
  'REG_CITY_NOT_WORK_CITY',
  'LIVE_CITY_NOT_WORK_CITY',
  'EMERGENCYSTATE_MODE',
  'FLAG_DOCUMENT_2',
  'FLAG_DOCUMENT_3',
  'FLAG_DOCUMENT_4',
  'FLAG_DOCUMENT_5',
  'FLAG_DOCUMENT_6',
  'FLAG_DOCUMENT_7',
  'FLAG_DOCUMENT_8',
  'FLAG_DOCUMENT_9',
  'FLAG_DOCUMENT_10',
  'FLAG_DOCUMENT_11',
  'FLAG_DOCUMENT_12',
  'FLAG_DOCUMENT_13',
  'FLAG_DOCUMENT_14',
  'FLAG_DOCUMENT_15',
  'FLAG_DOCUMENT_16',
  'FLAG_DOCUMENT_17',
  'FLAG_DOCUMENT_18',
  'FLAG_DOCUMENT_19',
  'FLAG_DOCUMENT_20',
  'FLAG_DOCUMENT_21'],
 ['NAME_CONTRACT_TYPE',
  'CODE_GENDER',
  'NAME_TYPE_SUITE',
  'NAME_INCOME_TYPE',
  'NAME_EDUCATION_TYPE',
  'NAME_FAMILY_STATUS',
  'NAME_HOUSING_TYPE',
  'OCCUPATION_TYPE',
  'REGION_RATING_CLIENT',
  'REGION_RATING_CLIENT_W_CITY',
  'ORGANIZATION_TYPE',
  'FONDKAPREMONT_MODE',
  'HOUSETYPE_MODE',
  'WALLSMATERIAL_MODE',
  'CNT_CHILDREN',
  'NWEEKDAY_PROCESS_START'],
 ['AMT_INCOME_TOTAL',
  'AMT_CREDIT',
  'AMT_ANNUITY',
  'AMT_GOODS_PRICE',
  'REGION_POPULATION_RELATIVE',
  'DAYS_REGISTRATION',
  'OWN_CAR_AGE',
  'CNT_FAM_MEMBERS',
  'EXT_SOURCE_1',
  'EXT_SOURCE_2',
  'EXT_SOURCE_3',
  'APARTMENTS_AVG',
  'BASEMENTAREA_AVG',
  'YEARS_BEGINEXPLUATATION_AVG',
  'YEARS_BUILD_AVG',
  'COMMONAREA_AVG',
  'ELEVATORS_AVG',
  'ENTRANCES_AVG',
  'FLOORSMAX_AVG',
  'FLOORSMIN_AVG',
  'LANDAREA_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'NONLIVINGAPARTMENTS_AVG',
  'NONLIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'BASEMENTAREA_MODE',
  'YEARS_BEGINEXPLUATATION_MODE',
  'YEARS_BUILD_MODE',
  'COMMONAREA_MODE',
  'ELEVATORS_MODE',
  'ENTRANCES_MODE',
  'FLOORSMAX_MODE',
  'FLOORSMIN_MODE',
  'LANDAREA_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'NONLIVINGAPARTMENTS_MODE',
  'NONLIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'BASEMENTAREA_MEDI',
  'YEARS_BEGINEXPLUATATION_MEDI',
  'YEARS_BUILD_MEDI',
  'COMMONAREA_MEDI',
  'ELEVATORS_MEDI',
  'ENTRANCES_MEDI',
  'FLOORSMAX_MEDI',
  'FLOORSMIN_MEDI',
  'LANDAREA_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'NONLIVINGAPARTMENTS_MEDI',
  'NONLIVINGAREA_MEDI',
  'TOTALAREA_MODE',
  'OBS_30_CNT_SOCIAL_CIRCLE',
  'DEF_30_CNT_SOCIAL_CIRCLE',
  'OBS_60_CNT_SOCIAL_CIRCLE',
  'DEF_60_CNT_SOCIAL_CIRCLE',
  'DAYS_LAST_PHONE_CHANGE',
  'AMT_REQ_CREDIT_BUREAU_HOUR',
  'AMT_REQ_CREDIT_BUREAU_DAY',
  'AMT_REQ_CREDIT_BUREAU_WEEK',
  'AMT_REQ_CREDIT_BUREAU_MON',
  'AMT_REQ_CREDIT_BUREAU_QRT',
  'AMT_REQ_CREDIT_BUREAU_YEAR',
  'DAYS_BIRTH',
  'DAYS_EMPLOYED',
  'DAYS_ID_PUBLISH',
  'HOUR_APPR_PROCESS_START'],
 [])
In [29]:
lista_var_bool, lista_var_cat, lista_var_con, lista_var_no_clasificadas = f_aux.nueva_clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
============================================================================================================================================================================
Variables Categóricas: 16 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'CNT_CHILDREN', 'NWEEKDAY_PROCESS_START']
============================================================================================================================================================================
Variables Continuas: 69 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START']
=============================================================================================================================================================================
Variables no clasificadas: 0 []

Conversión de tipo de datos

In [30]:
df[lista_var_cat] = df[lista_var_cat].astype("category")
df[lista_var_con] = df[lista_var_con].astype(float)
df[lista_var_con] = df[lista_var_con].apply(pd.to_numeric, errors='coerce')
df['TARGET'] = df['TARGET'].astype(int)
df.dtypes
Out[30]:
TARGET                             int64
NAME_CONTRACT_TYPE              category
CODE_GENDER                     category
FLAG_OWN_CAR                       int64
FLAG_OWN_REALTY                    int64
CNT_CHILDREN                    category
AMT_INCOME_TOTAL                 float64
AMT_CREDIT                       float64
AMT_ANNUITY                      float64
AMT_GOODS_PRICE                  float64
NAME_TYPE_SUITE                 category
NAME_INCOME_TYPE                category
NAME_EDUCATION_TYPE             category
NAME_FAMILY_STATUS              category
NAME_HOUSING_TYPE               category
REGION_POPULATION_RELATIVE       float64
DAYS_BIRTH                       float64
DAYS_EMPLOYED                    float64
DAYS_REGISTRATION                float64
DAYS_ID_PUBLISH                  float64
OWN_CAR_AGE                      float64
FLAG_MOBIL                         int64
FLAG_EMP_PHONE                     int64
FLAG_WORK_PHONE                    int64
FLAG_CONT_MOBILE                   int64
FLAG_PHONE                         int64
FLAG_EMAIL                         int64
OCCUPATION_TYPE                 category
CNT_FAM_MEMBERS                  float64
REGION_RATING_CLIENT            category
REGION_RATING_CLIENT_W_CITY     category
HOUR_APPR_PROCESS_START          float64
REG_REGION_NOT_LIVE_REGION         int64
REG_REGION_NOT_WORK_REGION         int64
LIVE_REGION_NOT_WORK_REGION        int64
REG_CITY_NOT_LIVE_CITY             int64
REG_CITY_NOT_WORK_CITY             int64
LIVE_CITY_NOT_WORK_CITY            int64
ORGANIZATION_TYPE               category
EXT_SOURCE_1                     float64
EXT_SOURCE_2                     float64
EXT_SOURCE_3                     float64
APARTMENTS_AVG                   float64
BASEMENTAREA_AVG                 float64
YEARS_BEGINEXPLUATATION_AVG      float64
YEARS_BUILD_AVG                  float64
COMMONAREA_AVG                   float64
ELEVATORS_AVG                    float64
ENTRANCES_AVG                    float64
FLOORSMAX_AVG                    float64
FLOORSMIN_AVG                    float64
LANDAREA_AVG                     float64
LIVINGAPARTMENTS_AVG             float64
LIVINGAREA_AVG                   float64
NONLIVINGAPARTMENTS_AVG          float64
NONLIVINGAREA_AVG                float64
APARTMENTS_MODE                  float64
BASEMENTAREA_MODE                float64
YEARS_BEGINEXPLUATATION_MODE     float64
YEARS_BUILD_MODE                 float64
COMMONAREA_MODE                  float64
ELEVATORS_MODE                   float64
ENTRANCES_MODE                   float64
FLOORSMAX_MODE                   float64
FLOORSMIN_MODE                   float64
LANDAREA_MODE                    float64
LIVINGAPARTMENTS_MODE            float64
LIVINGAREA_MODE                  float64
NONLIVINGAPARTMENTS_MODE         float64
NONLIVINGAREA_MODE               float64
APARTMENTS_MEDI                  float64
BASEMENTAREA_MEDI                float64
YEARS_BEGINEXPLUATATION_MEDI     float64
YEARS_BUILD_MEDI                 float64
COMMONAREA_MEDI                  float64
ELEVATORS_MEDI                   float64
ENTRANCES_MEDI                   float64
FLOORSMAX_MEDI                   float64
FLOORSMIN_MEDI                   float64
LANDAREA_MEDI                    float64
LIVINGAPARTMENTS_MEDI            float64
LIVINGAREA_MEDI                  float64
NONLIVINGAPARTMENTS_MEDI         float64
NONLIVINGAREA_MEDI               float64
FONDKAPREMONT_MODE              category
HOUSETYPE_MODE                  category
TOTALAREA_MODE                   float64
WALLSMATERIAL_MODE              category
EMERGENCYSTATE_MODE                int64
OBS_30_CNT_SOCIAL_CIRCLE         float64
DEF_30_CNT_SOCIAL_CIRCLE         float64
OBS_60_CNT_SOCIAL_CIRCLE         float64
DEF_60_CNT_SOCIAL_CIRCLE         float64
DAYS_LAST_PHONE_CHANGE           float64
FLAG_DOCUMENT_2                    int64
FLAG_DOCUMENT_3                    int64
FLAG_DOCUMENT_4                    int64
FLAG_DOCUMENT_5                    int64
FLAG_DOCUMENT_6                    int64
FLAG_DOCUMENT_7                    int64
FLAG_DOCUMENT_8                    int64
FLAG_DOCUMENT_9                    int64
FLAG_DOCUMENT_10                   int64
FLAG_DOCUMENT_11                   int64
FLAG_DOCUMENT_12                   int64
FLAG_DOCUMENT_13                   int64
FLAG_DOCUMENT_14                   int64
FLAG_DOCUMENT_15                   int64
FLAG_DOCUMENT_16                   int64
FLAG_DOCUMENT_17                   int64
FLAG_DOCUMENT_18                   int64
FLAG_DOCUMENT_19                   int64
FLAG_DOCUMENT_20                   int64
FLAG_DOCUMENT_21                   int64
AMT_REQ_CREDIT_BUREAU_HOUR       float64
AMT_REQ_CREDIT_BUREAU_DAY        float64
AMT_REQ_CREDIT_BUREAU_WEEK       float64
AMT_REQ_CREDIT_BUREAU_MON        float64
AMT_REQ_CREDIT_BUREAU_QRT        float64
AMT_REQ_CREDIT_BUREAU_YEAR       float64
NWEEKDAY_PROCESS_START          category
dtype: object

Separación en train y test estratificado

El propósito de este paso, es asegurar que las proporciones se mantengan equilibradas entre el conjunto de entrenamiento y el de prueba. Debido a que con esto se genera una mejor representatividad de los datos, permitiendo una evaluación más precisa del modelo.

In [31]:
X = df.drop('TARGET', axis=1)  # Eliminar la columna 'TARGET' del conjunto de características
y = df['TARGET']               # Guardar la columna 'TARGET' como variable objetivo
In [32]:
X_pd_loan, X_pd_loan_test, y_pd_loan, y_pd_loan_test = train_test_split(X, y, 
                                                                     stratify=df['TARGET'], 
                                                                     test_size=0.2, random_state = semilla)
df_train = pd.concat([X_pd_loan, y_pd_loan],axis=1)
df_test = pd.concat([X_pd_loan_test, y_pd_loan_test],axis=1)

print('== Train\n', df_train['TARGET'].value_counts(normalize=True))
print('== Test\n', df_test['TARGET'].value_counts(normalize=True))
== Train
 TARGET
0    0.919271
1    0.080729
Name: proportion, dtype: float64
== Test
 TARGET
0    0.919272
1    0.080728
Name: proportion, dtype: float64

En esta sección, se utiliza una semilla definida al inicio del notebook para garantizar la reproducibilidad y consistencia en el proceso de división de los datos en conjuntos de entrenamiento y prueba. Esto asegura que los resultados obtenidos sean replicables en futuras ejecuciones del mismo código."

Visualización descriptiva de los datos

Por medio de las funciones nulos_columna( ) y nulos_filas( ), podemos analizar la consistencia de los datos, al identificar la cantidad de valores nulos por variable. Lo cual nos permite evaluar qué variables podrían aportar más al modelo y cuáles podrían tener un impacto limitado debido a su alto porcentaje de valores nulos.

In [33]:
f_aux.nulos_columna(df)
Out[33]:
nulos_columnas porcentaje_columnas
COMMONAREA_MODE 214865 69.872297
COMMONAREA_MEDI 214865 69.872297
COMMONAREA_AVG 214865 69.872297
NONLIVINGAPARTMENTS_MEDI 213514 69.432963
NONLIVINGAPARTMENTS_MODE 213514 69.432963
NONLIVINGAPARTMENTS_AVG 213514 69.432963
FONDKAPREMONT_MODE 210295 68.386172
LIVINGAPARTMENTS_AVG 210199 68.354953
LIVINGAPARTMENTS_MEDI 210199 68.354953
LIVINGAPARTMENTS_MODE 210199 68.354953
FLOORSMIN_MODE 208642 67.848630
FLOORSMIN_MEDI 208642 67.848630
FLOORSMIN_AVG 208642 67.848630
YEARS_BUILD_AVG 204488 66.497784
YEARS_BUILD_MEDI 204488 66.497784
YEARS_BUILD_MODE 204488 66.497784
OWN_CAR_AGE 202929 65.990810
LANDAREA_MEDI 182590 59.376738
LANDAREA_AVG 182590 59.376738
LANDAREA_MODE 182590 59.376738
BASEMENTAREA_MEDI 179943 58.515956
BASEMENTAREA_MODE 179943 58.515956
BASEMENTAREA_AVG 179943 58.515956
EXT_SOURCE_1 173378 56.381073
NONLIVINGAREA_AVG 169682 55.179164
NONLIVINGAREA_MEDI 169682 55.179164
NONLIVINGAREA_MODE 169682 55.179164
ELEVATORS_AVG 163891 53.295980
ELEVATORS_MEDI 163891 53.295980
ELEVATORS_MODE 163891 53.295980
WALLSMATERIAL_MODE 156341 50.840783
APARTMENTS_AVG 156061 50.749729
APARTMENTS_MODE 156061 50.749729
APARTMENTS_MEDI 156061 50.749729
ENTRANCES_MODE 154828 50.348768
ENTRANCES_AVG 154828 50.348768
ENTRANCES_MEDI 154828 50.348768
LIVINGAREA_MODE 154350 50.193326
LIVINGAREA_AVG 154350 50.193326
LIVINGAREA_MEDI 154350 50.193326
HOUSETYPE_MODE 154297 50.176091
FLOORSMAX_MEDI 153020 49.760822
FLOORSMAX_AVG 153020 49.760822
FLOORSMAX_MODE 153020 49.760822
YEARS_BEGINEXPLUATATION_MEDI 150007 48.781019
YEARS_BEGINEXPLUATATION_MODE 150007 48.781019
YEARS_BEGINEXPLUATATION_AVG 150007 48.781019
TOTALAREA_MODE 148431 48.268517
OCCUPATION_TYPE 96391 31.345545
EXT_SOURCE_3 60965 19.825307
AMT_REQ_CREDIT_BUREAU_WEEK 41519 13.501631
AMT_REQ_CREDIT_BUREAU_YEAR 41519 13.501631
AMT_REQ_CREDIT_BUREAU_QRT 41519 13.501631
AMT_REQ_CREDIT_BUREAU_HOUR 41519 13.501631
AMT_REQ_CREDIT_BUREAU_MON 41519 13.501631
AMT_REQ_CREDIT_BUREAU_DAY 41519 13.501631
NAME_TYPE_SUITE 1292 0.420148
DEF_30_CNT_SOCIAL_CIRCLE 1021 0.332021
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.332021
OBS_30_CNT_SOCIAL_CIRCLE 1021 0.332021
OBS_60_CNT_SOCIAL_CIRCLE 1021 0.332021
EXT_SOURCE_2 660 0.214626
AMT_GOODS_PRICE 278 0.090403
AMT_ANNUITY 12 0.003902
CNT_FAM_MEMBERS 2 0.000650
DAYS_LAST_PHONE_CHANGE 1 0.000325
FLAG_DOCUMENT_5 0 0.000000
FLAG_DOCUMENT_6 0 0.000000
FLAG_DOCUMENT_7 0 0.000000
FLAG_DOCUMENT_8 0 0.000000
FLAG_DOCUMENT_4 0 0.000000
FLAG_DOCUMENT_12 0 0.000000
FLAG_DOCUMENT_3 0 0.000000
FLAG_DOCUMENT_2 0 0.000000
FLAG_DOCUMENT_11 0 0.000000
FLAG_DOCUMENT_21 0 0.000000
FLAG_DOCUMENT_20 0 0.000000
FLAG_DOCUMENT_19 0 0.000000
EMERGENCYSTATE_MODE 0 0.000000
FLAG_DOCUMENT_18 0 0.000000
FLAG_DOCUMENT_17 0 0.000000
FLAG_DOCUMENT_9 0 0.000000
FLAG_DOCUMENT_16 0 0.000000
FLAG_DOCUMENT_15 0 0.000000
FLAG_DOCUMENT_14 0 0.000000
FLAG_DOCUMENT_13 0 0.000000
FLAG_DOCUMENT_10 0 0.000000
TARGET 0 0.000000
NAME_CONTRACT_TYPE 0 0.000000
DAYS_ID_PUBLISH 0 0.000000
CODE_GENDER 0 0.000000
FLAG_OWN_CAR 0 0.000000
FLAG_OWN_REALTY 0 0.000000
CNT_CHILDREN 0 0.000000
AMT_INCOME_TOTAL 0 0.000000
AMT_CREDIT 0 0.000000
NAME_INCOME_TYPE 0 0.000000
NAME_EDUCATION_TYPE 0 0.000000
NAME_FAMILY_STATUS 0 0.000000
NAME_HOUSING_TYPE 0 0.000000
REGION_POPULATION_RELATIVE 0 0.000000
DAYS_BIRTH 0 0.000000
DAYS_EMPLOYED 0 0.000000
DAYS_REGISTRATION 0 0.000000
FLAG_MOBIL 0 0.000000
ORGANIZATION_TYPE 0 0.000000
FLAG_EMP_PHONE 0 0.000000
FLAG_WORK_PHONE 0 0.000000
FLAG_CONT_MOBILE 0 0.000000
FLAG_PHONE 0 0.000000
FLAG_EMAIL 0 0.000000
REGION_RATING_CLIENT 0 0.000000
REGION_RATING_CLIENT_W_CITY 0 0.000000
HOUR_APPR_PROCESS_START 0 0.000000
REG_REGION_NOT_LIVE_REGION 0 0.000000
REG_REGION_NOT_WORK_REGION 0 0.000000
LIVE_REGION_NOT_WORK_REGION 0 0.000000
REG_CITY_NOT_LIVE_CITY 0 0.000000
REG_CITY_NOT_WORK_CITY 0 0.000000
LIVE_CITY_NOT_WORK_CITY 0 0.000000
NWEEKDAY_PROCESS_START 0 0.000000
In [34]:
f_aux.nulos_filas(df)
Out[34]:
nulos_filas porcentaje_filas
SK_ID_CURR
235599 60 0.495868
412671 60 0.495868
315294 60 0.495868
255145 60 0.495868
412312 60 0.495868
... ... ...
250717 0 0.000000
250702 0 0.000000
250697 0 0.000000
250680 0 0.000000
278202 0 0.000000

307511 rows × 2 columns

Gráficos con distribibución de las variables

En la siguiente línea de código, se utiliza un bucle que itera sobre el tipo de variable. Dependiendo de si la variable es continua o categórica/booleana, se llama a la función plot_feature( ). Si la variable es continua, se generara un histograma y un boxplot en relación con la variable objetivo. Si la varaible es categórica o booleana, se mostrarán dos diagramas de barras: uno para la distribución general de la variable y otro en relación con la variable objetivo.

In [35]:
warnings.filterwarnings('ignore')
for i in list(df_train.columns):
    if (df_train[i].dtype==float) & (i!='TARGET'):
        print('Graficos de la variable: ' + i)
        f_aux.plot_feature(df_train, col_name=i, isContinuous=True, target='TARGET')
    elif  i!='TARGET':
        print('Graficos de la variable: ' + i)
        f_aux.plot_feature(df_train, col_name=i, isContinuous=False, target='TARGET')
Graficos de la variable: NAME_CONTRACT_TYPE
No description has been provided for this image
Graficos de la variable: CODE_GENDER
No description has been provided for this image
Graficos de la variable: FLAG_OWN_CAR
No description has been provided for this image
Graficos de la variable: FLAG_OWN_REALTY
No description has been provided for this image
Graficos de la variable: CNT_CHILDREN
No description has been provided for this image
Graficos de la variable: AMT_INCOME_TOTAL
No description has been provided for this image
Graficos de la variable: AMT_CREDIT
No description has been provided for this image
Graficos de la variable: AMT_ANNUITY
No description has been provided for this image
Graficos de la variable: AMT_GOODS_PRICE
No description has been provided for this image
Graficos de la variable: NAME_TYPE_SUITE
No description has been provided for this image
Graficos de la variable: NAME_INCOME_TYPE
No description has been provided for this image
Graficos de la variable: NAME_EDUCATION_TYPE
No description has been provided for this image
Graficos de la variable: NAME_FAMILY_STATUS
No description has been provided for this image
Graficos de la variable: NAME_HOUSING_TYPE
No description has been provided for this image
Graficos de la variable: REGION_POPULATION_RELATIVE
No description has been provided for this image
Graficos de la variable: DAYS_BIRTH
No description has been provided for this image
Graficos de la variable: DAYS_EMPLOYED
No description has been provided for this image
Graficos de la variable: DAYS_REGISTRATION
No description has been provided for this image
Graficos de la variable: DAYS_ID_PUBLISH
No description has been provided for this image
Graficos de la variable: OWN_CAR_AGE
No description has been provided for this image
Graficos de la variable: FLAG_MOBIL
No description has been provided for this image
Graficos de la variable: FLAG_EMP_PHONE
No description has been provided for this image
Graficos de la variable: FLAG_WORK_PHONE
No description has been provided for this image
Graficos de la variable: FLAG_CONT_MOBILE
No description has been provided for this image
Graficos de la variable: FLAG_PHONE
No description has been provided for this image
Graficos de la variable: FLAG_EMAIL
No description has been provided for this image
Graficos de la variable: OCCUPATION_TYPE
No description has been provided for this image
Graficos de la variable: CNT_FAM_MEMBERS
No description has been provided for this image
Graficos de la variable: REGION_RATING_CLIENT
No description has been provided for this image
Graficos de la variable: REGION_RATING_CLIENT_W_CITY
No description has been provided for this image
Graficos de la variable: HOUR_APPR_PROCESS_START
No description has been provided for this image
Graficos de la variable: REG_REGION_NOT_LIVE_REGION
No description has been provided for this image
Graficos de la variable: REG_REGION_NOT_WORK_REGION
No description has been provided for this image
Graficos de la variable: LIVE_REGION_NOT_WORK_REGION
No description has been provided for this image
Graficos de la variable: REG_CITY_NOT_LIVE_CITY
No description has been provided for this image
Graficos de la variable: REG_CITY_NOT_WORK_CITY
No description has been provided for this image
Graficos de la variable: LIVE_CITY_NOT_WORK_CITY
No description has been provided for this image
Graficos de la variable: ORGANIZATION_TYPE
No description has been provided for this image
Graficos de la variable: EXT_SOURCE_1
No description has been provided for this image
Graficos de la variable: EXT_SOURCE_2
No description has been provided for this image
Graficos de la variable: EXT_SOURCE_3
No description has been provided for this image
Graficos de la variable: APARTMENTS_AVG
No description has been provided for this image
Graficos de la variable: BASEMENTAREA_AVG
No description has been provided for this image
Graficos de la variable: YEARS_BEGINEXPLUATATION_AVG
No description has been provided for this image
Graficos de la variable: YEARS_BUILD_AVG
No description has been provided for this image
Graficos de la variable: COMMONAREA_AVG
No description has been provided for this image
Graficos de la variable: ELEVATORS_AVG
No description has been provided for this image
Graficos de la variable: ENTRANCES_AVG
No description has been provided for this image
Graficos de la variable: FLOORSMAX_AVG
No description has been provided for this image
Graficos de la variable: FLOORSMIN_AVG
No description has been provided for this image
Graficos de la variable: LANDAREA_AVG
No description has been provided for this image
Graficos de la variable: LIVINGAPARTMENTS_AVG
No description has been provided for this image
Graficos de la variable: LIVINGAREA_AVG
No description has been provided for this image
Graficos de la variable: NONLIVINGAPARTMENTS_AVG
No description has been provided for this image
Graficos de la variable: NONLIVINGAREA_AVG
No description has been provided for this image
Graficos de la variable: APARTMENTS_MODE
No description has been provided for this image
Graficos de la variable: BASEMENTAREA_MODE
No description has been provided for this image
Graficos de la variable: YEARS_BEGINEXPLUATATION_MODE
No description has been provided for this image
Graficos de la variable: YEARS_BUILD_MODE
No description has been provided for this image
Graficos de la variable: COMMONAREA_MODE
No description has been provided for this image
Graficos de la variable: ELEVATORS_MODE
No description has been provided for this image
Graficos de la variable: ENTRANCES_MODE
No description has been provided for this image
Graficos de la variable: FLOORSMAX_MODE
No description has been provided for this image
Graficos de la variable: FLOORSMIN_MODE
No description has been provided for this image
Graficos de la variable: LANDAREA_MODE
No description has been provided for this image
Graficos de la variable: LIVINGAPARTMENTS_MODE
No description has been provided for this image
Graficos de la variable: LIVINGAREA_MODE
No description has been provided for this image
Graficos de la variable: NONLIVINGAPARTMENTS_MODE
No description has been provided for this image
Graficos de la variable: NONLIVINGAREA_MODE
No description has been provided for this image
Graficos de la variable: APARTMENTS_MEDI
No description has been provided for this image
Graficos de la variable: BASEMENTAREA_MEDI
No description has been provided for this image
Graficos de la variable: YEARS_BEGINEXPLUATATION_MEDI
No description has been provided for this image
Graficos de la variable: YEARS_BUILD_MEDI
No description has been provided for this image
Graficos de la variable: COMMONAREA_MEDI
No description has been provided for this image
Graficos de la variable: ELEVATORS_MEDI
No description has been provided for this image
Graficos de la variable: ENTRANCES_MEDI
No description has been provided for this image
Graficos de la variable: FLOORSMAX_MEDI
No description has been provided for this image
Graficos de la variable: FLOORSMIN_MEDI
No description has been provided for this image
Graficos de la variable: LANDAREA_MEDI
No description has been provided for this image
Graficos de la variable: LIVINGAPARTMENTS_MEDI
No description has been provided for this image
Graficos de la variable: LIVINGAREA_MEDI
No description has been provided for this image
Graficos de la variable: NONLIVINGAPARTMENTS_MEDI
No description has been provided for this image
Graficos de la variable: NONLIVINGAREA_MEDI
No description has been provided for this image
Graficos de la variable: FONDKAPREMONT_MODE
No description has been provided for this image
Graficos de la variable: HOUSETYPE_MODE
No description has been provided for this image
Graficos de la variable: TOTALAREA_MODE
No description has been provided for this image
Graficos de la variable: WALLSMATERIAL_MODE
No description has been provided for this image
Graficos de la variable: EMERGENCYSTATE_MODE
No description has been provided for this image
Graficos de la variable: OBS_30_CNT_SOCIAL_CIRCLE
No description has been provided for this image
Graficos de la variable: DEF_30_CNT_SOCIAL_CIRCLE
No description has been provided for this image
Graficos de la variable: OBS_60_CNT_SOCIAL_CIRCLE
No description has been provided for this image
Graficos de la variable: DEF_60_CNT_SOCIAL_CIRCLE
No description has been provided for this image
Graficos de la variable: DAYS_LAST_PHONE_CHANGE
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_2
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_3
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_4
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_5
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_6
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_7
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_8
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_9
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_10
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_11
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_12
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_13
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_14
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_15
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_16
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_17
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_18
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_19
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_20
No description has been provided for this image
Graficos de la variable: FLAG_DOCUMENT_21
No description has been provided for this image
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_HOUR
No description has been provided for this image
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_DAY
No description has been provided for this image
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_WEEK
No description has been provided for this image
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_MON
No description has been provided for this image
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_QRT
No description has been provided for this image
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_YEAR
No description has been provided for this image
Graficos de la variable: NWEEKDAY_PROCESS_START
No description has been provided for this image

Conclusiones de los gráficos

En estas 121 gráficas, se pueden observar las variaciones tanto de manera individual como con respecto a la variable objetivo. Al plantear esta conclusión, es relevante comenzar desde lo particular hacia lo general. En un primer análisis, observamos aspectos individuales como el género, donde los hombres son quienes tienen una mayor tasa de pago del préstamo en comparación con las mujeres. En cuanto al nivel educativo, se evidencia que, a mayor nivel educativo, hay una mayor tendencia a saldar el préstamo (asociado con la variable 0). En términos de edad, las personas mayores tienen una mayor probabilidad de devolver el préstamo, lo que se refleja en las claras diferencias entre los rangos intercuartílicos del boxplot.

De manera más general, se destacan el tipo de trabajo y la organización en la que se labora. Se observa que las personas que trabajan en ambientes formales y bien establecidos, como grandes empresas, tienen mayores probabilidades de devolver el préstamo en tiempo y forma. En contraste, aquellos que desempeñan oficios o trabajos menos especializados, como los trabajadores de baja cualificación, personal de camareros y conductores, tienden a tener una menor tasa de pago puntual.

Finalmente, existen variables que resultan determinantes para el modelo, tales como el ingreso, la referencia de otros bancos, la situación de tu círculo cercano y el lugar de residencia. Estos factores son indicadores de la capacidad económica y la estabilidad de las personas, lo que afecta directamente su capacidad para afrontar pagos de préstamos.

Tratamiento de las variables continuas

A continuación, se tratan los valores missing, las correlaciones de las variables continuas y los outliers.

In [36]:
lista_var_con
Out[36]:
['AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'REGION_POPULATION_RELATIVE',
 'DAYS_REGISTRATION',
 'OWN_CAR_AGE',
 'CNT_FAM_MEMBERS',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_AVG',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'ELEVATORS_AVG',
 'ENTRANCES_AVG',
 'FLOORSMAX_AVG',
 'FLOORSMIN_AVG',
 'LANDAREA_AVG',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG',
 'APARTMENTS_MODE',
 'BASEMENTAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'ELEVATORS_MODE',
 'ENTRANCES_MODE',
 'FLOORSMAX_MODE',
 'FLOORSMIN_MODE',
 'LANDAREA_MODE',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAREA_MODE',
 'APARTMENTS_MEDI',
 'BASEMENTAREA_MEDI',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'TOTALAREA_MODE',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'DAYS_LAST_PHONE_CHANGE',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_ID_PUBLISH',
 'HOUR_APPR_PROCESS_START']

Por medio de la función de get_deviation_of_mean_perc( ), se determina que proporción de las variables continuas se situan fuera de un intervalo de confianza basado en la media y la desviación estándar, siendo multiplicada por el factor multiplier. En este caso la función nos da el número y porcentaje de valores fuera del rango, a la par de detertminar como se distribuyen estos valores extremos conforme a la variable objetivo.

In [54]:
f_aux.get_deviation_of_mean_perc(df_train, lista_var_con, target = 'TARGET', multiplier = 3)
Out[54]:
variable 0 1 sum_outlier_values porcentaje_sum_null_values
0 AMT_INCOME_TOTAL 0.947115 0.052885 208 0.000846
1 AMT_CREDIT 0.958763 0.041237 2619 0.010646
2 AMT_ANNUITY 0.963606 0.036394 2363 0.009605
3 AMT_GOODS_PRICE 0.962963 0.037037 3321 0.013500
4 REGION_POPULATION_RELATIVE 0.960321 0.039679 6729 0.027353
5 DAYS_REGISTRATION 0.957586 0.042414 613 0.002492
6 OWN_CAR_AGE 0.915541 0.084459 2664 0.010829
7 CNT_FAM_MEMBERS 0.902377 0.097623 3155 0.012825
8 APARTMENTS_AVG 0.949831 0.050169 2372 0.009642
9 BASEMENTAREA_AVG 0.948604 0.051396 1576 0.006406
10 YEARS_BEGINEXPLUATATION_AVG 0.906526 0.093474 567 0.002305
11 YEARS_BUILD_AVG 0.927597 0.072403 953 0.003874
12 COMMONAREA_AVG 0.941691 0.058309 1372 0.005577
13 ELEVATORS_AVG 0.955647 0.044353 1939 0.007882
14 ENTRANCES_AVG 0.939684 0.060316 1774 0.007211
15 FLOORSMAX_AVG 0.957046 0.042954 2072 0.008422
16 FLOORSMIN_AVG 0.960870 0.039130 460 0.001870
17 LANDAREA_AVG 0.933374 0.066626 1651 0.006711
18 LIVINGAPARTMENTS_AVG 0.948958 0.051042 1391 0.005654
19 LIVINGAREA_AVG 0.948134 0.051866 2545 0.010345
20 NONLIVINGAPARTMENTS_AVG 0.929174 0.070826 593 0.002410
21 NONLIVINGAREA_AVG 0.946875 0.053125 1920 0.007805
22 APARTMENTS_MODE 0.950021 0.049979 2401 0.009760
23 BASEMENTAREA_MODE 0.946789 0.053211 1635 0.006646
24 YEARS_BEGINEXPLUATATION_MODE 0.904676 0.095324 556 0.002260
25 YEARS_BUILD_MODE 0.928423 0.071577 964 0.003919
26 COMMONAREA_MODE 0.938462 0.061538 1365 0.005549
27 ELEVATORS_MODE 0.952078 0.047922 2671 0.010857
28 ENTRANCES_MODE 0.938601 0.061399 1759 0.007150
29 FLOORSMAX_MODE 0.958591 0.041409 2101 0.008540
30 FLOORSMIN_MODE 0.963061 0.036939 379 0.001541
31 LANDAREA_MODE 0.932749 0.067251 1710 0.006951
32 LIVINGAPARTMENTS_MODE 0.946191 0.053809 1431 0.005817
33 LIVINGAREA_MODE 0.948134 0.051866 2680 0.010894
34 NONLIVINGAPARTMENTS_MODE 0.921429 0.078571 560 0.002276
35 NONLIVINGAREA_MODE 0.947773 0.052227 1953 0.007939
36 APARTMENTS_MEDI 0.949938 0.050062 2417 0.009825
37 BASEMENTAREA_MEDI 0.949057 0.050943 1590 0.006463
38 YEARS_BEGINEXPLUATATION_MEDI 0.902985 0.097015 536 0.002179
39 YEARS_BUILD_MEDI 0.928200 0.071800 961 0.003906
40 COMMONAREA_MEDI 0.940374 0.059626 1392 0.005658
41 ELEVATORS_MEDI 0.954969 0.045031 1932 0.007853
42 ENTRANCES_MEDI 0.938833 0.061167 1782 0.007244
43 FLOORSMAX_MEDI 0.956861 0.043139 2179 0.008857
44 FLOORSMIN_MEDI 0.960648 0.039352 432 0.001756
45 LANDAREA_MEDI 0.935807 0.064193 1698 0.006902
46 LIVINGAPARTMENTS_MEDI 0.947745 0.052255 1397 0.005679
47 LIVINGAREA_MEDI 0.949495 0.050505 2574 0.010463
48 NONLIVINGAPARTMENTS_MEDI 0.926995 0.073005 589 0.002394
49 NONLIVINGAREA_MEDI 0.947152 0.052848 1949 0.007923
50 TOTALAREA_MODE 0.956032 0.043968 2661 0.010817
51 OBS_30_CNT_SOCIAL_CIRCLE 0.907786 0.092214 4945 0.020101
52 DEF_30_CNT_SOCIAL_CIRCLE 0.881830 0.118170 5509 0.022394
53 OBS_60_CNT_SOCIAL_CIRCLE 0.907311 0.092689 4801 0.019516
54 DEF_60_CNT_SOCIAL_CIRCLE 0.872681 0.127319 3126 0.012707
55 DAYS_LAST_PHONE_CHANGE 0.961847 0.038153 498 0.002024
56 AMT_REQ_CREDIT_BUREAU_HOUR 0.918750 0.081250 1280 0.005203
57 AMT_REQ_CREDIT_BUREAU_DAY 0.902813 0.097187 1173 0.004768
58 AMT_REQ_CREDIT_BUREAU_WEEK 0.921424 0.078576 6796 0.027625
59 AMT_REQ_CREDIT_BUREAU_MON 0.947531 0.052469 2592 0.010536
60 AMT_REQ_CREDIT_BUREAU_QRT 0.916026 0.083974 1822 0.007406
61 AMT_REQ_CREDIT_BUREAU_YEAR 0.909022 0.090978 2649 0.010768
62 HOUR_APPR_PROCESS_START 0.898167 0.101833 491 0.001996

Conclusiones del impacto de las variables continuas con respecto a la variable objetivo

Cuando una variable presenta un mayor número de valores fuera del intervalo de confianza, nos indica una alta dispersión en los datos. Por lo que son más relevantes en la evaluación de riesgos por parte del banco, ya que van relacionadas a perfiles más diversos en los solicitantes, un ejemplo es la variable CNT_FAM_MEMBERS, que presenta 3,155 valores fuera del intervalo de confianza, indicando una mayor heterogeneidad en los tamaños de las familias, lo cual es relevante para la evaluar riesgos, asociandolo con el cumplimiento del préstamo.

Por otro lado, variables con un menor número de valores fuera del intervalo, un ejemplo es AMT_INCOME_TOTAL con solo 208 valores atípicos, sugiere que los solicitantes tienen ingresos similares. Indicando un perfil más homogéneo entre ellos en cuestión de esta variable. A partir de este análisis, es posible identificar variables clave para establecer perfiles generales de los solicitantes.

Gráfica de correlación

In [38]:
f_aux.get_corr_matrix(dataset = df_train[lista_var_con], metodo = 'pearson', size_figure = [10,8])
No description has been provided for this image
Out[38]:
0
In [39]:
corr = df_train[lista_var_con].corr('pearson')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1) # below main lower triangle of an array
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']> 0.6]
Out[39]:
level_0 level_1 correlation
3918 OBS_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE 0.998514
2912 YEARS_BUILD_MEDI YEARS_BUILD_AVG 0.998391
3262 FLOORSMIN_MEDI FLOORSMIN_AVG 0.997322
3192 FLOORSMAX_MEDI FLOORSMAX_AVG 0.996983
3122 ENTRANCES_MEDI ENTRANCES_AVG 0.996911
3052 ELEVATORS_MEDI ELEVATORS_AVG 0.996319
2982 COMMONAREA_MEDI COMMONAREA_AVG 0.995660
3472 LIVINGAREA_MEDI LIVINGAREA_AVG 0.995472
2702 APARTMENTS_MEDI APARTMENTS_AVG 0.995430
2772 BASEMENTAREA_MEDI BASEMENTAREA_AVG 0.994335
2842 YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_AVG 0.994314
3402 LIVINGAPARTMENTS_MEDI LIVINGAPARTMENTS_AVG 0.993621
3612 NONLIVINGAREA_MEDI NONLIVINGAREA_AVG 0.991197
3332 LANDAREA_MEDI LANDAREA_AVG 0.991056
1946 YEARS_BUILD_MODE YEARS_BUILD_AVG 0.989372
2926 YEARS_BUILD_MEDI YEARS_BUILD_MODE 0.989272
3542 NONLIVINGAPARTMENTS_MEDI NONLIVINGAPARTMENTS_AVG 0.989047
3276 FLOORSMIN_MEDI FLOORSMIN_MODE 0.988735
3206 FLOORSMAX_MEDI FLOORSMAX_MODE 0.988205
208 AMT_GOODS_PRICE AMT_CREDIT 0.987000
2296 FLOORSMIN_MODE FLOORSMIN_AVG 0.986250
2226 FLOORSMAX_MODE FLOORSMAX_AVG 0.985561
3066 ELEVATORS_MEDI ELEVATORS_MODE 0.982819
3346 LANDAREA_MEDI LANDAREA_MODE 0.981517
3556 NONLIVINGAPARTMENTS_MEDI NONLIVINGAPARTMENTS_MODE 0.981259
3136 ENTRANCES_MEDI ENTRANCES_MODE 0.981012
2086 ELEVATORS_MODE ELEVATORS_AVG 0.979161
2996 COMMONAREA_MEDI COMMONAREA_MODE 0.978934
2156 ENTRANCES_MODE ENTRANCES_AVG 0.978034
2786 BASEMENTAREA_MEDI BASEMENTAREA_MODE 0.977787
2716 APARTMENTS_MEDI APARTMENTS_MODE 0.977514
3626 NONLIVINGAREA_MEDI NONLIVINGAREA_MODE 0.976066
2016 COMMONAREA_MODE COMMONAREA_AVG 0.975988
3486 LIVINGAREA_MEDI LIVINGAREA_MODE 0.975391
3416 LIVINGAPARTMENTS_MEDI LIVINGAPARTMENTS_MODE 0.975138
1736 APARTMENTS_MODE APARTMENTS_AVG 0.974062
1806 BASEMENTAREA_MODE BASEMENTAREA_AVG 0.973389
1876 YEARS_BEGINEXPLUATATION_MODE YEARS_BEGINEXPLUATATION_AVG 0.973181
2366 LANDAREA_MODE LANDAREA_AVG 0.973105
2506 LIVINGAREA_MODE LIVINGAREA_AVG 0.972434
2576 NONLIVINGAPARTMENTS_MODE NONLIVINGAPARTMENTS_AVG 0.970068
2436 LIVINGAPARTMENTS_MODE LIVINGAPARTMENTS_AVG 0.969449
2646 NONLIVINGAREA_MODE NONLIVINGAREA_AVG 0.967162
2856 YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_MODE 0.966567
1460 LIVINGAPARTMENTS_AVG APARTMENTS_AVG 0.945033
3420 LIVINGAPARTMENTS_MEDI APARTMENTS_MEDI 0.943933
3392 LIVINGAPARTMENTS_MEDI APARTMENTS_AVG 0.943237
2440 LIVINGAPARTMENTS_MODE APARTMENTS_MODE 0.940871
2712 APARTMENTS_MEDI LIVINGAPARTMENTS_AVG 0.936922
2726 APARTMENTS_MEDI LIVINGAPARTMENTS_MODE 0.934145
2426 LIVINGAPARTMENTS_MODE APARTMENTS_AVG 0.933212
3679 TOTALAREA_MODE LIVINGAREA_AVG 0.925936
3707 TOTALAREA_MODE LIVINGAREA_MEDI 0.920434
3406 LIVINGAPARTMENTS_MEDI APARTMENTS_MODE 0.916930
3489 LIVINGAREA_MEDI APARTMENTS_MEDI 0.916000
1529 LIVINGAREA_AVG APARTMENTS_AVG 0.913769
3461 LIVINGAREA_MEDI APARTMENTS_AVG 0.912728
2713 APARTMENTS_MEDI LIVINGAREA_AVG 0.912434
2509 LIVINGAREA_MODE APARTMENTS_MODE 0.910780
1746 APARTMENTS_MODE LIVINGAPARTMENTS_AVG 0.910617
3693 TOTALAREA_MODE LIVINGAREA_MODE 0.900830
2727 APARTMENTS_MEDI LIVINGAREA_MODE 0.897113
2495 LIVINGAREA_MODE APARTMENTS_AVG 0.894706
3668 TOTALAREA_MODE APARTMENTS_AVG 0.894517
3475 LIVINGAREA_MEDI APARTMENTS_MODE 0.894490
1747 APARTMENTS_MODE LIVINGAREA_AVG 0.891087
3696 TOTALAREA_MODE APARTMENTS_MEDI 0.888248
3499 LIVINGAREA_MEDI LIVINGAPARTMENTS_MEDI 0.885952
3403 LIVINGAPARTMENTS_MEDI LIVINGAREA_AVG 0.884223
2519 LIVINGAREA_MODE LIVINGAPARTMENTS_MODE 0.881550
1539 LIVINGAREA_AVG LIVINGAPARTMENTS_AVG 0.881535
3471 LIVINGAREA_MEDI LIVINGAPARTMENTS_AVG 0.879412
3485 LIVINGAREA_MEDI LIVINGAPARTMENTS_MODE 0.876144
2437 LIVINGAPARTMENTS_MODE LIVINGAREA_AVG 0.874517
3494 LIVINGAREA_MEDI ELEVATORS_MEDI 0.868383
1534 LIVINGAREA_AVG ELEVATORS_AVG 0.867590
3466 LIVINGAREA_MEDI ELEVATORS_AVG 0.865807
3682 TOTALAREA_MODE APARTMENTS_MODE 0.865652
3058 ELEVATORS_MEDI LIVINGAREA_AVG 0.865594
3417 LIVINGAPARTMENTS_MEDI LIVINGAREA_MODE 0.860090
3988 DEF_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 0.859088
2514 LIVINGAREA_MODE ELEVATORS_MODE 0.856213
3480 LIVINGAREA_MEDI ELEVATORS_MODE 0.856028
2505 LIVINGAREA_MODE LIVINGAPARTMENTS_AVG 0.854263
2092 ELEVATORS_MODE LIVINGAREA_AVG 0.852650
3678 TOTALAREA_MODE LIVINGAPARTMENTS_AVG 0.851677
3706 TOTALAREA_MODE LIVINGAPARTMENTS_MEDI 0.849999
3673 TOTALAREA_MODE ELEVATORS_AVG 0.846518
3072 ELEVATORS_MEDI LIVINGAREA_MODE 0.841594
3701 TOTALAREA_MODE ELEVATORS_MEDI 0.840034
2500 LIVINGAREA_MODE ELEVATORS_AVG 0.839674
3692 TOTALAREA_MODE LIVINGAPARTMENTS_MODE 0.838990
3075 ELEVATORS_MEDI APARTMENTS_MEDI 0.838494
1115 ELEVATORS_AVG APARTMENTS_AVG 0.838347
3047 ELEVATORS_MEDI APARTMENTS_AVG 0.836446
2707 APARTMENTS_MEDI ELEVATORS_AVG 0.835813
2095 ELEVATORS_MODE APARTMENTS_MODE 0.827446
2721 APARTMENTS_MEDI ELEVATORS_MODE 0.826966
2081 ELEVATORS_MODE APARTMENTS_AVG 0.824274
3687 TOTALAREA_MODE ELEVATORS_MODE 0.822899
3425 LIVINGAPARTMENTS_MEDI ELEVATORS_MEDI 0.815880
3397 LIVINGAPARTMENTS_MEDI ELEVATORS_AVG 0.814409
1465 LIVINGAPARTMENTS_AVG ELEVATORS_AVG 0.813318
3057 ELEVATORS_MEDI LIVINGAPARTMENTS_AVG 0.810824
2445 LIVINGAPARTMENTS_MODE ELEVATORS_MODE 0.810610
3061 ELEVATORS_MEDI APARTMENTS_MODE 0.810200
1741 APARTMENTS_MODE ELEVATORS_AVG 0.807742
3411 LIVINGAPARTMENTS_MEDI ELEVATORS_MODE 0.802015
3071 ELEVATORS_MEDI LIVINGAPARTMENTS_MODE 0.800567
2431 LIVINGAPARTMENTS_MODE ELEVATORS_AVG 0.799062
2091 ELEVATORS_MODE LIVINGAPARTMENTS_AVG 0.796596
209 AMT_GOODS_PRICE AMT_ANNUITY 0.775310
139 AMT_ANNUITY AMT_CREDIT 0.770163
1329 FLOORSMIN_AVG FLOORSMAX_AVG 0.739772
3289 FLOORSMIN_MEDI FLOORSMAX_MEDI 0.737659
3261 FLOORSMIN_MEDI FLOORSMAX_AVG 0.737192
3193 FLOORSMAX_MEDI FLOORSMIN_AVG 0.737175
3275 FLOORSMIN_MEDI FLOORSMAX_MODE 0.727059
2227 FLOORSMAX_MODE FLOORSMIN_AVG 0.726317
2309 FLOORSMIN_MODE FLOORSMAX_MODE 0.723685
3207 FLOORSMAX_MEDI FLOORSMIN_MODE 0.720675
2295 FLOORSMIN_MODE FLOORSMAX_AVG 0.720125
1530 LIVINGAREA_AVG BASEMENTAREA_AVG 0.692715
2782 BASEMENTAREA_MEDI LIVINGAREA_AVG 0.692455
3490 LIVINGAREA_MEDI BASEMENTAREA_MEDI 0.691733
2510 LIVINGAREA_MODE BASEMENTAREA_MODE 0.690915
3462 LIVINGAREA_MEDI BASEMENTAREA_AVG 0.689655
2796 BASEMENTAREA_MEDI LIVINGAREA_MODE 0.680955
2799 BASEMENTAREA_MEDI APARTMENTS_MEDI 0.680276
1258 FLOORSMAX_AVG ELEVATORS_AVG 0.680109
839 BASEMENTAREA_AVG APARTMENTS_AVG 0.679130
2771 BASEMENTAREA_MEDI APARTMENTS_AVG 0.678835
1819 BASEMENTAREA_MODE APARTMENTS_MODE 0.678681
2703 APARTMENTS_MEDI BASEMENTAREA_AVG 0.677920
3190 FLOORSMAX_MEDI ELEVATORS_AVG 0.677855
2496 LIVINGAREA_MODE BASEMENTAREA_AVG 0.677536
3054 ELEVATORS_MEDI FLOORSMAX_AVG 0.676167
3218 FLOORSMAX_MEDI ELEVATORS_MEDI 0.675603
1816 BASEMENTAREA_MODE LIVINGAREA_AVG 0.674590
3476 LIVINGAREA_MEDI BASEMENTAREA_MODE 0.674313
3669 TOTALAREA_MODE BASEMENTAREA_AVG 0.672990
2224 FLOORSMAX_MODE ELEVATORS_AVG 0.671161
3697 TOTALAREA_MODE BASEMENTAREA_MEDI 0.670219
3068 ELEVATORS_MEDI FLOORSMAX_MODE 0.669194
2785 BASEMENTAREA_MEDI APARTMENTS_MODE 0.668579
1737 APARTMENTS_MODE BASEMENTAREA_AVG 0.665837
2717 APARTMENTS_MEDI BASEMENTAREA_MODE 0.664087
1805 BASEMENTAREA_MODE APARTMENTS_AVG 0.661715
2238 FLOORSMAX_MODE ELEVATORS_MODE 0.660813
2441 LIVINGAPARTMENTS_MODE BASEMENTAREA_MODE 0.657202
2088 ELEVATORS_MODE FLOORSMAX_AVG 0.655993
3204 FLOORSMAX_MEDI ELEVATORS_MODE 0.655301
3421 LIVINGAPARTMENTS_MEDI BASEMENTAREA_MEDI 0.654280
2795 BASEMENTAREA_MEDI LIVINGAPARTMENTS_MODE 0.654117
2165 ENTRANCES_MODE BASEMENTAREA_MODE 0.653678
1811 BASEMENTAREA_MODE ENTRANCES_AVG 0.653412
2777 BASEMENTAREA_MEDI ENTRANCES_AVG 0.652667
3131 ENTRANCES_MEDI BASEMENTAREA_MODE 0.652286
3393 LIVINGAPARTMENTS_MEDI BASEMENTAREA_AVG 0.651640
3145 ENTRANCES_MEDI BASEMENTAREA_MEDI 0.651115
1185 ENTRANCES_AVG BASEMENTAREA_AVG 0.650657
3683 TOTALAREA_MODE BASEMENTAREA_MODE 0.650591
2427 LIVINGAPARTMENTS_MODE BASEMENTAREA_AVG 0.649819
2781 BASEMENTAREA_MEDI LIVINGAPARTMENTS_AVG 0.649458
1461 LIVINGAPARTMENTS_AVG BASEMENTAREA_AVG 0.649358
3117 ENTRANCES_MEDI BASEMENTAREA_AVG 0.646530
3675 TOTALAREA_MODE FLOORSMAX_AVG 0.633699
3407 LIVINGAPARTMENTS_MEDI BASEMENTAREA_MODE 0.632946
3703 TOTALAREA_MODE FLOORSMAX_MEDI 0.631241
2791 BASEMENTAREA_MEDI ENTRANCES_MODE 0.630842
1536 LIVINGAREA_AVG FLOORSMAX_AVG 0.629838
1815 BASEMENTAREA_MODE LIVINGAPARTMENTS_AVG 0.628019
3196 FLOORSMAX_MEDI LIVINGAREA_AVG 0.627858
3689 TOTALAREA_MODE FLOORSMAX_MODE 0.626663
3468 LIVINGAREA_MEDI FLOORSMAX_AVG 0.626568
2151 ENTRANCES_MODE BASEMENTAREA_AVG 0.625835
3496 LIVINGAREA_MEDI FLOORSMAX_MEDI 0.625791
2230 FLOORSMAX_MODE LIVINGAREA_AVG 0.625579
3482 LIVINGAREA_MEDI FLOORSMAX_MODE 0.623980
2501 LIVINGAREA_MODE ENTRANCES_AVG 0.623472
2515 LIVINGAREA_MODE ENTRANCES_MODE 0.623182
3141 ENTRANCES_MEDI LIVINGAREA_MODE 0.622801
3467 LIVINGAREA_MEDI ENTRANCES_AVG 0.619999
3495 LIVINGAREA_MEDI ENTRANCES_MEDI 0.619562
1253 FLOORSMAX_AVG APARTMENTS_AVG 0.619371
1535 LIVINGAREA_AVG ENTRANCES_AVG 0.619216
3185 FLOORSMAX_MEDI APARTMENTS_AVG 0.617271
2709 APARTMENTS_MEDI FLOORSMAX_AVG 0.615685
3127 ENTRANCES_MEDI LIVINGAREA_AVG 0.615367
2219 FLOORSMAX_MODE APARTMENTS_AVG 0.615358
2164 ENTRANCES_MODE APARTMENTS_MODE 0.615096
3213 FLOORSMAX_MEDI APARTMENTS_MEDI 0.614799
4619 DAYS_EMPLOYED DAYS_BIRTH 0.614650
2723 APARTMENTS_MEDI FLOORSMAX_MODE 0.613281
3130 ENTRANCES_MEDI APARTMENTS_MODE 0.611987
1742 APARTMENTS_MODE ENTRANCES_AVG 0.611935
2708 APARTMENTS_MEDI ENTRANCES_AVG 0.610877
1184 ENTRANCES_AVG APARTMENTS_AVG 0.610653
3144 ENTRANCES_MEDI APARTMENTS_MEDI 0.610636
3116 ENTRANCES_MEDI APARTMENTS_AVG 0.606969
2516 LIVINGAREA_MODE FLOORSMAX_MODE 0.605667
4493 DAYS_BIRTH EXT_SOURCE_1 0.601112

Conclusiones correlación entre variables

Existen valores extremadamente altos en la correlación entre ciertas variables, estas llegan a presentar el mismo tipo de variable, unicamente cambia el valor estadístico, un ejemplo es YEARS_BUILD_MEDI y YEARS_BUILD_AVG. La presencia de variables redundantes en un modelo predictivo afecta la estabilidad y la interpretrabilidad del modelo, por lo que es necesario identificarlas para sacarlas del modelo.

Asimismo, en las variables relacionadas con ingresos, crédito y situación laboral, generan casos con una relación proporcional directa, como lo es AMT_ANNUITY y AMT_CREDIT. Con el cual se puede establecer que si uno aumenta el otro lo hará a la par, es decir, si la cantidad de crédito requerido aumenta, la anualidad lo hará también. Esto de igual manera que la anterior sección, donde se establecen los valores dentro de un intervalo de confianza, nos ayuda a identificar patrones más complejos en los perfiles de los solicitantes.

Tratamiento valores nulos (Variables Continuas)

In [40]:
lista_var_con
Out[40]:
['AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'REGION_POPULATION_RELATIVE',
 'DAYS_REGISTRATION',
 'OWN_CAR_AGE',
 'CNT_FAM_MEMBERS',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_AVG',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'ELEVATORS_AVG',
 'ENTRANCES_AVG',
 'FLOORSMAX_AVG',
 'FLOORSMIN_AVG',
 'LANDAREA_AVG',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG',
 'APARTMENTS_MODE',
 'BASEMENTAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'ELEVATORS_MODE',
 'ENTRANCES_MODE',
 'FLOORSMAX_MODE',
 'FLOORSMIN_MODE',
 'LANDAREA_MODE',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAREA_MODE',
 'APARTMENTS_MEDI',
 'BASEMENTAREA_MEDI',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'TOTALAREA_MODE',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'DAYS_LAST_PHONE_CHANGE',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_ID_PUBLISH',
 'HOUR_APPR_PROCESS_START']
In [41]:
f_aux.get_percent_null_values_target(pd_loan = df_train, list_var_continuous = lista_var_con, target = 'TARGET')
Out[41]:
Category_0 variable sum_null_values porcentaje_sum_null_values Category_1
0 1.000000 AMT_ANNUITY 10 0.000041 NaN
1 0.923077 AMT_GOODS_PRICE 221 0.000898 0.076923
2 0.915163 OWN_CAR_AGE 162418 0.660214 0.084837
3 1.000000 CNT_FAM_MEMBERS 2 0.000008 NaN
4 0.914752 EXT_SOURCE_1 138595 0.563376 0.085248
5 0.922787 EXT_SOURCE_2 531 0.002158 0.077213
6 0.907223 EXT_SOURCE_3 48805 0.198388 0.092777
7 0.908612 APARTMENTS_AVG 124732 0.507024 0.091388
8 0.911054 BASEMENTAREA_AVG 143829 0.584652 0.088946
9 0.908069 YEARS_BEGINEXPLUATATION_AVG 119949 0.487582 0.091931
10 0.913381 YEARS_BUILD_AVG 163543 0.664787 0.086619
11 0.914441 COMMONAREA_AVG 171811 0.698396 0.085559
12 0.909309 ELEVATORS_AVG 131017 0.532572 0.090691
13 0.908366 ENTRANCES_AVG 123775 0.503134 0.091634
14 0.908191 FLOORSMAX_AVG 122297 0.497126 0.091809
15 0.913863 FLOORSMIN_AVG 166921 0.678519 0.086137
16 0.912066 LANDAREA_AVG 145985 0.593416 0.087934
17 0.913972 LIVINGAPARTMENTS_AVG 168119 0.683388 0.086028
18 0.908725 LIVINGAREA_AVG 123462 0.501862 0.091275
19 0.914273 NONLIVINGAPARTMENTS_AVG 170729 0.693998 0.085727
20 0.909750 NONLIVINGAREA_AVG 135624 0.551299 0.090250
21 0.908612 APARTMENTS_MODE 124732 0.507024 0.091388
22 0.911054 BASEMENTAREA_MODE 143829 0.584652 0.088946
23 0.908069 YEARS_BEGINEXPLUATATION_MODE 119949 0.487582 0.091931
24 0.913381 YEARS_BUILD_MODE 163543 0.664787 0.086619
25 0.914441 COMMONAREA_MODE 171811 0.698396 0.085559
26 0.909309 ELEVATORS_MODE 131017 0.532572 0.090691
27 0.908366 ENTRANCES_MODE 123775 0.503134 0.091634
28 0.908191 FLOORSMAX_MODE 122297 0.497126 0.091809
29 0.913863 FLOORSMIN_MODE 166921 0.678519 0.086137
30 0.912066 LANDAREA_MODE 145985 0.593416 0.087934
31 0.913972 LIVINGAPARTMENTS_MODE 168119 0.683388 0.086028
32 0.908725 LIVINGAREA_MODE 123462 0.501862 0.091275
33 0.914273 NONLIVINGAPARTMENTS_MODE 170729 0.693998 0.085727
34 0.909750 NONLIVINGAREA_MODE 135624 0.551299 0.090250
35 0.908612 APARTMENTS_MEDI 124732 0.507024 0.091388
36 0.911054 BASEMENTAREA_MEDI 143829 0.584652 0.088946
37 0.908069 YEARS_BEGINEXPLUATATION_MEDI 119949 0.487582 0.091931
38 0.913381 YEARS_BUILD_MEDI 163543 0.664787 0.086619
39 0.914441 COMMONAREA_MEDI 171811 0.698396 0.085559
40 0.909309 ELEVATORS_MEDI 131017 0.532572 0.090691
41 0.908366 ENTRANCES_MEDI 123775 0.503134 0.091634
42 0.908191 FLOORSMAX_MEDI 122297 0.497126 0.091809
43 0.913863 FLOORSMIN_MEDI 166921 0.678519 0.086137
44 0.912066 LANDAREA_MEDI 145985 0.593416 0.087934
45 0.913972 LIVINGAPARTMENTS_MEDI 168119 0.683388 0.086028
46 0.908725 LIVINGAREA_MEDI 123462 0.501862 0.091275
47 0.914273 NONLIVINGAPARTMENTS_MEDI 170729 0.693998 0.085727
48 0.909750 NONLIVINGAREA_MEDI 135624 0.551299 0.090250
49 0.907756 TOTALAREA_MODE 118707 0.482533 0.092244
50 0.960543 OBS_30_CNT_SOCIAL_CIRCLE 811 0.003297 0.039457
51 0.960543 DEF_30_CNT_SOCIAL_CIRCLE 811 0.003297 0.039457
52 0.960543 OBS_60_CNT_SOCIAL_CIRCLE 811 0.003297 0.039457
53 0.960543 DEF_60_CNT_SOCIAL_CIRCLE 811 0.003297 0.039457
54 1.000000 DAYS_LAST_PHONE_CHANGE 1 0.000004 NaN
55 0.896613 AMT_REQ_CREDIT_BUREAU_HOUR 33244 0.135134 0.103387
56 0.896613 AMT_REQ_CREDIT_BUREAU_DAY 33244 0.135134 0.103387
57 0.896613 AMT_REQ_CREDIT_BUREAU_WEEK 33244 0.135134 0.103387
58 0.896613 AMT_REQ_CREDIT_BUREAU_MON 33244 0.135134 0.103387
59 0.896613 AMT_REQ_CREDIT_BUREAU_QRT 33244 0.135134 0.103387
60 0.896613 AMT_REQ_CREDIT_BUREAU_YEAR 33244 0.135134 0.103387

Conclusiones de porcentaje de valores nulos

Por medio del anterior análisis, es posible clasificar las variables en dos grupos: variables a imputar y variables a eliminar. Con esta categorización, se optimiza la calidad del conjunot de datos para los modelos predictivos. Sin embargo, es importante considerar también el significado detrás de los valores nulos. En este caso, los valores nulos pueden reflejar que el cliente no proporcionó ciertos documentos o información requerida. Por lo que dependiendo de la variable, puede llegar a ser un indicador de mayor riesgo.

Imputar valores nulos (Variables Continuas)

A continuación, se generaron dos listas para imputar los valores faltantes en el conjunto de datos, una para la imputación con la media y otra para la imputación con la mediana.

La decisión de utilizar estos métodos se basó en el porcentaje de valores faltantes en cada variable. Las que presentaban un porcentaje de valores nulos menor o igual al 30% se imputaron con la media, ya que se asumió que estos valores presentaban una distribución relativamente uniforme y no afectarían significativamente las relaciones entre las variables. Sustituir por la media es apropiado cuando los datos no contienen outliers representativos y presentan una distribución simétrica o normal.

Por otro lado, las variables con un porcentaje de valores nulos superior al 30% fueron imputadas con la mediana, debido a que esta es más robusta frente a los outliers y las distribuciones segmentadas. En el caso de que se hubieran imputado estos datos con la media, podría distorsionar el análisis debido a los valores atípicos o una distribución sesgada.

In [42]:
lista_imputar_media = []
lista_imputar_mediana = []

for variable in df_train[lista_var_con]:
    if variable in ['AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_2', 'OBS_30_CNT_SOCIAL_CIRCLE', 
    'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE']:
        lista_imputar_media.append(variable)
    else:
        lista_imputar_mediana.append(variable)
        
print("Lista Imputar Media:", lista_imputar_media)
print("Lista Imputar Mediana:", lista_imputar_mediana)
Lista Imputar Media: ['AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_2', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE']
Lista Imputar Mediana: ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'EXT_SOURCE_1', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START']

En la siguiente seccion, generamos una copia de la base de datos para así mantener la integridad de estos y facilitar la gestión a lo largo de las etapas del proceso de análisis.

In [43]:
copia_df_train = df_train.copy()
copia_df_test = df_test.copy()
In [44]:
# Imputar con Media
copia_df_train[lista_imputar_media] = copia_df_train[lista_imputar_media].apply(lambda x: x.fillna(x.mean()))
copia_df_test[lista_imputar_media] = copia_df_test[lista_imputar_media].apply(lambda x: x.fillna(x.mean()))

# Imputar con Mediana
copia_df_train[lista_imputar_mediana] = copia_df_train[lista_imputar_mediana].apply(lambda x: x.fillna(x.median()))
copia_df_test[lista_imputar_mediana] = copia_df_test[lista_imputar_mediana].apply(lambda x: x.fillna(x.median()))

Rectificamos que ya no se presentan valores nulos en este tipo de variable

In [45]:
# Filtrar los valores nulos solo para las variables de lista_var_con
nulos_train_con = copia_df_train[lista_var_con].isnull().sum()
nulos_test_con = copia_df_test[lista_var_con].isnull().sum()

# Imprimir los valores nulos por variable en los dos DataFrames
print("Valores nulos por variable (copia_df_train) :")
print(nulos_train_con)

print("\nValores nulos por variable (copia_df_test) :")
print(nulos_test_con)
Valores nulos por variable (copia_df_train) :
AMT_INCOME_TOTAL                0
AMT_CREDIT                      0
AMT_ANNUITY                     0
AMT_GOODS_PRICE                 0
REGION_POPULATION_RELATIVE      0
DAYS_REGISTRATION               0
OWN_CAR_AGE                     0
CNT_FAM_MEMBERS                 0
EXT_SOURCE_1                    0
EXT_SOURCE_2                    0
EXT_SOURCE_3                    0
APARTMENTS_AVG                  0
BASEMENTAREA_AVG                0
YEARS_BEGINEXPLUATATION_AVG     0
YEARS_BUILD_AVG                 0
COMMONAREA_AVG                  0
ELEVATORS_AVG                   0
ENTRANCES_AVG                   0
FLOORSMAX_AVG                   0
FLOORSMIN_AVG                   0
LANDAREA_AVG                    0
LIVINGAPARTMENTS_AVG            0
LIVINGAREA_AVG                  0
NONLIVINGAPARTMENTS_AVG         0
NONLIVINGAREA_AVG               0
APARTMENTS_MODE                 0
BASEMENTAREA_MODE               0
YEARS_BEGINEXPLUATATION_MODE    0
YEARS_BUILD_MODE                0
COMMONAREA_MODE                 0
ELEVATORS_MODE                  0
ENTRANCES_MODE                  0
FLOORSMAX_MODE                  0
FLOORSMIN_MODE                  0
LANDAREA_MODE                   0
LIVINGAPARTMENTS_MODE           0
LIVINGAREA_MODE                 0
NONLIVINGAPARTMENTS_MODE        0
NONLIVINGAREA_MODE              0
APARTMENTS_MEDI                 0
BASEMENTAREA_MEDI               0
YEARS_BEGINEXPLUATATION_MEDI    0
YEARS_BUILD_MEDI                0
COMMONAREA_MEDI                 0
ELEVATORS_MEDI                  0
ENTRANCES_MEDI                  0
FLOORSMAX_MEDI                  0
FLOORSMIN_MEDI                  0
LANDAREA_MEDI                   0
LIVINGAPARTMENTS_MEDI           0
LIVINGAREA_MEDI                 0
NONLIVINGAPARTMENTS_MEDI        0
NONLIVINGAREA_MEDI              0
TOTALAREA_MODE                  0
OBS_30_CNT_SOCIAL_CIRCLE        0
DEF_30_CNT_SOCIAL_CIRCLE        0
OBS_60_CNT_SOCIAL_CIRCLE        0
DEF_60_CNT_SOCIAL_CIRCLE        0
DAYS_LAST_PHONE_CHANGE          0
AMT_REQ_CREDIT_BUREAU_HOUR      0
AMT_REQ_CREDIT_BUREAU_DAY       0
AMT_REQ_CREDIT_BUREAU_WEEK      0
AMT_REQ_CREDIT_BUREAU_MON       0
AMT_REQ_CREDIT_BUREAU_QRT       0
AMT_REQ_CREDIT_BUREAU_YEAR      0
DAYS_BIRTH                      0
DAYS_EMPLOYED                   0
DAYS_ID_PUBLISH                 0
HOUR_APPR_PROCESS_START         0
dtype: int64

Valores nulos por variable (copia_df_test) :
AMT_INCOME_TOTAL                0
AMT_CREDIT                      0
AMT_ANNUITY                     0
AMT_GOODS_PRICE                 0
REGION_POPULATION_RELATIVE      0
DAYS_REGISTRATION               0
OWN_CAR_AGE                     0
CNT_FAM_MEMBERS                 0
EXT_SOURCE_1                    0
EXT_SOURCE_2                    0
EXT_SOURCE_3                    0
APARTMENTS_AVG                  0
BASEMENTAREA_AVG                0
YEARS_BEGINEXPLUATATION_AVG     0
YEARS_BUILD_AVG                 0
COMMONAREA_AVG                  0
ELEVATORS_AVG                   0
ENTRANCES_AVG                   0
FLOORSMAX_AVG                   0
FLOORSMIN_AVG                   0
LANDAREA_AVG                    0
LIVINGAPARTMENTS_AVG            0
LIVINGAREA_AVG                  0
NONLIVINGAPARTMENTS_AVG         0
NONLIVINGAREA_AVG               0
APARTMENTS_MODE                 0
BASEMENTAREA_MODE               0
YEARS_BEGINEXPLUATATION_MODE    0
YEARS_BUILD_MODE                0
COMMONAREA_MODE                 0
ELEVATORS_MODE                  0
ENTRANCES_MODE                  0
FLOORSMAX_MODE                  0
FLOORSMIN_MODE                  0
LANDAREA_MODE                   0
LIVINGAPARTMENTS_MODE           0
LIVINGAREA_MODE                 0
NONLIVINGAPARTMENTS_MODE        0
NONLIVINGAREA_MODE              0
APARTMENTS_MEDI                 0
BASEMENTAREA_MEDI               0
YEARS_BEGINEXPLUATATION_MEDI    0
YEARS_BUILD_MEDI                0
COMMONAREA_MEDI                 0
ELEVATORS_MEDI                  0
ENTRANCES_MEDI                  0
FLOORSMAX_MEDI                  0
FLOORSMIN_MEDI                  0
LANDAREA_MEDI                   0
LIVINGAPARTMENTS_MEDI           0
LIVINGAREA_MEDI                 0
NONLIVINGAPARTMENTS_MEDI        0
NONLIVINGAREA_MEDI              0
TOTALAREA_MODE                  0
OBS_30_CNT_SOCIAL_CIRCLE        0
DEF_30_CNT_SOCIAL_CIRCLE        0
OBS_60_CNT_SOCIAL_CIRCLE        0
DEF_60_CNT_SOCIAL_CIRCLE        0
DAYS_LAST_PHONE_CHANGE          0
AMT_REQ_CREDIT_BUREAU_HOUR      0
AMT_REQ_CREDIT_BUREAU_DAY       0
AMT_REQ_CREDIT_BUREAU_WEEK      0
AMT_REQ_CREDIT_BUREAU_MON       0
AMT_REQ_CREDIT_BUREAU_QRT       0
AMT_REQ_CREDIT_BUREAU_YEAR      0
DAYS_BIRTH                      0
DAYS_EMPLOYED                   0
DAYS_ID_PUBLISH                 0
HOUR_APPR_PROCESS_START         0
dtype: int64

Tratamiento valores nulos (Variables Categóricas y Variables Booleanas)

In [46]:
lista_var_cat
Out[46]:
['NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'OCCUPATION_TYPE',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'ORGANIZATION_TYPE',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'WALLSMATERIAL_MODE',
 'CNT_CHILDREN',
 'NWEEKDAY_PROCESS_START']
In [47]:
lista_var_bool
Out[47]:
['TARGET',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'EMERGENCYSTATE_MODE',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21']

En cada lista de tipo de variable, generamos un bucle para que itere sobre ella y obtenemos las variables con valores nulos. Es importante mencionar que, de manera previa, se visualizó que la categoría de variables booleanas no presentaba valores nulos. Por esta razón, se genera el mensaje en el bucle.

In [48]:
col_cat = df_train.select_dtypes(include=['category']).columns.tolist()

for col in col_cat:
    valores_nulos = df_train[col].isnull().sum()
    tipo_variable = df_train[col].dtype
    valores_unicos = df_train[col].unique()
    if valores_nulos > 0:
        print(f"Variable: {col}")
        print(f"  - Valores faltantes: {valores_nulos}")
        print(f"  - Tipo de variable: {tipo_variable}")
        print(f"  - Valores únicos: {valores_unicos}")
        print("-" * 90)
Variable: NAME_TYPE_SUITE
  - Valores faltantes: 1029
  - Tipo de variable: category
  - Valores únicos: ['Unaccompanied', 'Spouse, partner', 'Family', 'Other_B', NaN, 'Children', 'Group of people', 'Other_A']
Categories (7, object): ['Children', 'Family', 'Group of people', 'Other_A', 'Other_B', 'Spouse, partner', 'Unaccompanied']
------------------------------------------------------------------------------------------
Variable: OCCUPATION_TYPE
  - Valores faltantes: 76940
  - Tipo de variable: category
  - Valores únicos: ['Laborers', 'Drivers', 'Accountants', NaN, 'Sales staff', ..., 'IT staff', 'Realty agents', 'HR staff', 'Secretaries', 'Cleaning staff']
Length: 19
Categories (18, object): ['Accountants', 'Cleaning staff', 'Cooking staff', 'Core staff', ..., 'Sales staff', 'Secretaries', 'Security staff', 'Waiters/barmen staff']
------------------------------------------------------------------------------------------
Variable: FONDKAPREMONT_MODE
  - Valores faltantes: 168215
  - Tipo de variable: category
  - Valores únicos: ['reg oper account', NaN, 'reg oper spec account', 'org spec account', 'not specified']
Categories (4, object): ['not specified', 'org spec account', 'reg oper account', 'reg oper spec account']
------------------------------------------------------------------------------------------
Variable: HOUSETYPE_MODE
  - Valores faltantes: 123328
  - Tipo de variable: category
  - Valores únicos: ['block of flats', NaN, 'specific housing', 'terraced house']
Categories (3, object): ['block of flats', 'specific housing', 'terraced house']
------------------------------------------------------------------------------------------
Variable: WALLSMATERIAL_MODE
  - Valores faltantes: 124975
  - Tipo de variable: category
  - Valores únicos: ['Panel', NaN, 'Block', 'Stone, brick', 'Mixed', 'Others', 'Wooden', 'Monolithic']
Categories (7, object): ['Block', 'Mixed', 'Monolithic', 'Others', 'Panel', 'Stone, brick', 'Wooden']
------------------------------------------------------------------------------------------
In [49]:
col_bool = df_train.select_dtypes(include=[bool]).columns.tolist()

# Determinar que no existen valores nulos
hay_valores_nulos = False

for col in col_bool:
    valores_nulos = df_train[col].isnull().sum()
    tipo_variable = df_train[col].dtype
    valores_unicos = df_train[col].unique()
    
    if valores_nulos > 0:
        hay_valores_nulos = True
        print(f"Variable: {col}")
        print(f"  - Valores faltantes: {valores_nulos}")
        print(f"  - Tipo de variable: {tipo_variable}")
        print(f"  - Valores únicos: {valores_unicos}")
        print("-" * 90)
    else:
        print(f"Variable: {col} - No tiene valores nulos")

# En el caso de que no se encuentren valores nulos
if not hay_valores_nulos:
    print("Ninguna variable tiene valores nulos.")
Ninguna variable tiene valores nulos.

Valor de Cramérs V.

El proósito de calcular este valor, es medir la fuerza de asociación entre dos variables categóricas, indicando que tan relacionados están. El rango de este valor va de 0 a 1, donde entre más cercano sea a uno, mayor es la fuerza de asociación. A pesar de que mide la relación entre variables, no nos determina la causalidad entre estas, debido a que no infiere en que una genere a la otra.

In [55]:
for variable in lista_var_cat:
    print('-'*90)
    print('Matriz de confusión {variiable} con respecto a TARGET:')
    confusion_matriz = pd.crosstab(df_train['TARGET'], df_train[variable])
    print(confusion_matriz)
    valor_cramer = f_aux.cramers_v(confusion_matrix = confusion_matriz.values)
    print('Valor de Cramers:', valor_cramer )
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_CONTRACT_TYPE  Cash loans  Revolving loans
TARGET                                         
0                       203988            22160
1                        18572             1288
Valor de Cramers: 0.030647843080174268
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
CODE_GENDER       F      M  XNA
TARGET                         
0            150553  75593    2
1             11334   8526    0
Valor de Cramers: 0.05451190495295015
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_TYPE_SUITE  Children  Family  Group of people  Other_A  Other_B  Spouse, partner  Unaccompanied
TARGET                                                                                              
0                    2479   29690              202      636     1293             8354         182522
1                     194    2390               19       61      139              714          16286
Valor de Cramers: 0.00969475405832943
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_INCOME_TYPE  Businessman  Commercial associate  Maternity leave  Pensioner  State servant  Student  Unemployed  Working
TARGET                                                                                                                      
0                           9                 52925                3      41769          16494       11          12   114925
1                           0                  4348                0       2365           1024        0           5    12118
Valor de Cramers: 0.06202037950610154
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_EDUCATION_TYPE  Academic degree  Higher education  Incomplete higher  Lower secondary  Secondary / secondary special
TARGET                                                                                                                   
0                                122             56576               7574             2690                         159186
1                                  2              3226                715              335                          15582
Valor de Cramers: 0.056615705136302014
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_FAMILY_STATUS  Civil marriage  Married  Separated  Single / not married  Unknown  Widow
TARGET                                                                                      
0                            21427   145313      14575                 32743        2  12088
1                             2385    11840       1265                  3616        0    754
Valor de Cramers: 0.04198246008049476
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_HOUSING_TYPE  Co-op apartment  House / apartment  Municipal apartment  Office apartment  Rented apartment  With parents
TARGET                                                                                                                      
0                              804             201363                 8158              1961              3395         10467
1                               78              17020                  758               136               478          1390
Valor de Cramers: 0.03695484251026577
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
OCCUPATION_TYPE  Accountants  Cleaning staff  Cooking staff  Core staff  Drivers  HR staff  High skill tech staff  IT staff  Laborers  Low-skill Laborers  Managers  Medicine staff  Private service staff  Realty agents  Sales staff  Secretaries  Security staff  Waiters/barmen staff
TARGET                                                                                                                                                                                                                                                                                   
0                       7517            3384           4266       20683    13261       412                   8522       398     39602                1389     16032            6391                   1956            546        23148          972            4798                   926
1                        365             357            501        1410     1683        31                    571        26      4657                 307      1064             480                    138             48         2458           75             574                   120
Valor de Cramers: 0.08109692091074842
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
REGION_RATING_CLIENT      1       2      3
TARGET                                    
0                     24506  167318  34324
1                      1249   14313   4298
Valor de Cramers: 0.05890284619889794
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
REGION_RATING_CLIENT_W_CITY      1       2      3
TARGET                                           
0                            25993  169111  31044
1                             1330   14511   4019
Valor de Cramers: 0.06135786316089503
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
ORGANIZATION_TYPE  Advertising  Agriculture  Bank  Business Entity Type 1  Business Entity Type 2  Business Entity Type 3  Cleaning  Construction  Culture  Electricity  Emergency  Government  Hotel  Housing  Industry: type 1  Industry: type 10  Industry: type 11  Industry: type 12  Industry: type 13  Industry: type 2  Industry: type 3  Industry: type 4  Industry: type 5  Industry: type 6  Industry: type 7  Industry: type 8  Industry: type 9  Insurance  Kindergarten  Legal Services  Medicine  Military  Mobile  Other  Police  Postal  Realtor  Religion  Restaurant  School  Security  Security Ministries  Self-employed  Services  Telecom  Trade: type 1  Trade: type 2  Trade: type 3  Trade: type 4  Trade: type 5  Trade: type 6  Trade: type 7  Transport: type 1  Transport: type 2  Transport: type 3  Transport: type 4  University    XNA
TARGET                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
0                          302         1762  1890                    4432                    7703                   49388       188          4754      278          701        426        7768    733     2184               751                 79               1988                281                 44               344              2364               649               435                82               964                16              2539        449          5156             229      8379      2009     229  12310    1769    1598      287        65        1274    6760      2333                 1517          27450      1175      415            247           1419           2514             50             39            472           5673                152               1630                812               3919         999  41773
1                           29          213   110                     388                     728                    5048        24           645       18           56         32         573     50      187                95                  6                181                 12                  7                31               286                67                31                 5                81                 3               181         27           396              21       606       104      21    997      92     140       37         4         168     419       274                   81           3122        80       37             27            104            286              2              1             20            591                  8                139                152                392          55   2370
Valor de Cramers: 0.07181461273694796
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FONDKAPREMONT_MODE  not specified  org spec account  reg oper account  reg oper spec account
TARGET                                                                                      
0                            4228              4218             54913                   9038
1                             346               277              4133                    640
Valor de Cramers: 0.008709666149370406
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
HOUSETYPE_MODE  block of flats  specific housing  terraced house
TARGET                                                          
0                       112086              1091             912
1                         8392               118              81
Valor de Cramers: 0.010835039064936313
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
WALLSMATERIAL_MODE  Block  Mixed  Monolithic  Others  Panel  Stone, brick  Wooden
TARGET                                                                           
0                    6947   1667        1376    1203  49487         48023    3864
1                     526    132          69     105   3367          3839     428
Valor de Cramers: 0.030261061414271043
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
CNT_CHILDREN       0      1      2     3    4   5   6  7  8  9  10  11  12  14  19
TARGET                                                                            
0             158996  44679  19444  2657  282  63  13  7  2  0   2   0   1   1   1
1              13236   4427   1860   279   43   6   6  0  0  2   0   1   0   0   0
Valor de Cramers: 0.025614997270705864
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NWEEKDAY_PROCESS_START      1      2      3      4      5      6      7
TARGET                                                                 
0                       37449  39585  38081  37219  36939  24975  11900
1                        3141   3582   3403   3242   3302   2129   1061
Valor de Cramers: 0.0053830022738895695
In [51]:
for variable in lista_var_bool:
    print('-'*90)
    print('Matriz de confusión {variiable} con respecto a TARGET:')
    confusion_matriz = pd.crosstab(df_train['TARGET'], df_train[variable])
    print(confusion_matriz)
    valor_cramer = f_aux.cramers_v(confusion_matrix = confusion_matriz.values)
    print('Valor de Cramers:', valor_cramer )
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
TARGET       0      1
TARGET               
0       226148      0
1            0  19860
Valor de Cramers: 0.9999726127135284
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_OWN_CAR       0      1
TARGET                     
0             148634  77514
1              13779   6081
Valor de Cramers: 0.020917624000671178
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_OWN_REALTY      0       1
TARGET                        
0                69058  157090
1                 6260   13600
Valor de Cramers: 0.005438185035782544
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_MOBIL  0       1
TARGET               
0           1  226147
1           0   19860
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_EMP_PHONE      0       1
TARGET                       
0               41783  184365
1                2371   17489
Valor de Cramers: 0.04634411452463542
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_WORK_PHONE       0      1
TARGET                        
0                181749  44399
1                 15136   4724
Valor de Cramers: 0.02821564665685336
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_CONT_MOBILE    0       1
TARGET                       
0                 420  225728
1                  42   19818
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_PHONE       0      1
TARGET                   
0           161896  64252
1            14998   4862
Valor de Cramers: 0.023718445281449046
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_EMAIL       0      1
TARGET                   
0           213309  12839
1            18742   1118
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
REG_REGION_NOT_LIVE_REGION       0     1
TARGET                                  
0                           222790  3358
1                            19523   337
Valor de Cramers: 0.004231264273072046
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
REG_REGION_NOT_WORK_REGION       0      1
TARGET                                   
0                           214679  11469
1                            18745   1115
Valor de Cramers: 0.0063669493983877
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
LIVE_REGION_NOT_WORK_REGION       0     1
TARGET                                   
0                            216872  9276
1                             19007   853
Valor de Cramers: 0.0016623195533143134
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
REG_CITY_NOT_LIVE_CITY       0      1
TARGET                               
0                       209352  16796
1                        17493   2367
Valor de Cramers: 0.045581251151933795
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
REG_CITY_NOT_WORK_CITY       0      1
TARGET                               
0                       175357  50791
1                        13812   6048
Valor de Cramers: 0.051608603270209136
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
LIVE_CITY_NOT_WORK_CITY       0      1
TARGET                                
0                        186219  39929
1                         15419   4441
Valor de Cramers: 0.03325849436451175
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
EMERGENCYSTATE_MODE       0     1
TARGET                           
0                    224466  1682
1                     19685   175
Valor de Cramers: 0.0037283320336315645
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_2       0  1
TARGET                    
0                226140  8
1                 19857  3
Valor de Cramers: 0.002979057679884047
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_3      0       1
TARGET                        
0                66828  159320
1                 4404   15456
Valor de Cramers: 0.04423623442859539
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_4       0   1
TARGET                     
0                226126  22
1                 19860   0
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_5       0     1
TARGET                       
0                222727  3421
1                 19563   297
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_6       0      1
TARGET                        
0                205783  20365
1                 18646   1214
Valor de Cramers: 0.027754414576280584
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_7       0   1
TARGET                     
0                226105  43
1                 19857   3
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_8       0      1
TARGET                        
0                207584  18564
1                 18401   1459
Valor de Cramers: 0.008323539864287808
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_9       0    1
TARGET                      
0                225258  890
1                 19803   57
Valor de Cramers: 0.004097171764277719
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_10       0  1
TARGET                     
0                 226144  4
1                  19860  0
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_11       0    1
TARGET                       
0                 225257  891
1                  19800   60
Valor de Cramers: 0.0033536791216609795
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_12       0  1
TARGET                     
0                 226147  1
1                  19860  0
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_13       0    1
TARGET                       
0                 225339  809
1                  19837   23
Valor de Cramers: 0.011040517273742204
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_14       0    1
TARGET                       
0                 225449  699
1                  19833   27
Valor de Cramers: 0.008316733249016585
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_15       0    1
TARGET                       
0                 225861  287
1                  19852    8
Valor de Cramers: 0.006287931203058809
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_16       0     1
TARGET                        
0                 223807  2341
1                  19736   124
Valor de Cramers: 0.010977504954893965
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_17       0   1
TARGET                      
0                 226088  60
1                  19859   1
Valor de Cramers: 0.0025432021717690235
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_18       0     1
TARGET                        
0                 224264  1884
1                  19742   118
Valor de Cramers: 0.006871884485363943
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_19       0    1
TARGET                       
0                 226008  140
1                  19850   10
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_20       0    1
TARGET                       
0                 226036  112
1                  19848   12
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FLAG_DOCUMENT_21       0   1
TARGET                      
0                 226081  67
1                  19848  12
Valor de Cramers: 0.0037594804049241475

Conclusión sobre valor de Cramérs v.

A lo largo de los resultados obtenidos de los valores de Cramérs, obtenemos valores reducidos o débiles, donde donde los valores de Cramérs de las variables FLAG_MOBIL, FLAG_CONT_MOBILE, FLAG_EMAIL, FLAG_DOCUMENT_4, FLAG_DOCUMENT_7, FLAG_DOCUMENT_10, FLAG_DOCUMENT_12, FLAG_DOCUMENT_19, FLAG_DOCUMENT_20 y otras variables, tienen valores cercanos a 0, por lo que podemos empezar a establecer que estas variables las podemos considerar como irrelevantes para el modelado.

Es importante el no menospreciar variables con valores pequeños, más no cercanos a ceros, debido a que estos sumados nos podrían dar un impacto acumulativo al combinarlo con otras características. Algunas de las variables que presentan estos valores son FLAG_OWN_CAR, FLAG_PHONE, FLAG_DOCUMENT_3, REG_CITY_NOT_WORK_CITY, NAME_HOUSING_TYPE, y REG_CITY_NOT_LIVE_CITY

De igual manera tenemos una muestra de variables con mayor relevancia a las anteriores, como lo es CODE_GENDER (0.0545), indicando que el género esta relacionado con la variable objetivo, NAME_EDUCATION_TYPE (0.0566) representando el nivel educativo, OCCUPATION_TYPE (0.0811) tomando en cuenta la ocupación del solicitante y el ORGANIZATION_TYPE (0.0718), simbolizando el tipo de organización donde labora el solicitante. Estas variables de manera lógica aportan más que las anteriores, debido a que son aspectos que puedes escalar.

Imputar valores nulos (Variables Categóricas)

In [52]:
copia_df_train[lista_var_cat] = copia_df_train[lista_var_cat].astype("object").fillna("SIN VALOR").astype("category")
copia_df_test[lista_var_cat] = copia_df_test[lista_var_cat].astype("object").fillna("SIN VALOR").astype("category")

Guardar CSV

In [53]:
copia_df_train.to_csv('/Users/miguelflores/Desktop/CSV/train_df_preprocessing_missing_outlier.csv')
copia_df_test.to_csv('/Users/miguelflores/Desktop/CSV/test_df_preprocessing_missing_outlier.csv')